I'm having a calculation problem with pandas and I'd like to know if anyone could help me.
Having this df created using this code:
df = pd.DataFrame({'B': [0, 2, 1, np.nan, 4, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
df
| | B |
|-------------------------|------|
| 2013-01-01 09:31:23.999 | 0.0 |
| 2013-01-01 09:31:24.200 | 2.0 |
| 2013-01-01 09:31:24.250 | 1.0 |
| 2013-01-01 09:31:25.000 | NaN |
| 2013-01-01 09:31:25.375 | 4.0 |
| 2013-01-01 09:31:25.850 | 1.0 |
| 2013-01-01 09:31:26.100 | 3.0 |
| 2013-01-01 09:31:27.150 | 10.0 |
| 2013-01-01 09:31:28.050 | NaN |
| 2013-01-01 09:31:28.850 | 3.0 |
| 2013-01-01 09:31:29.200 | 6.0 |
I would like to be able to calculate for each row what the maximum variation of B has been during one second.
For example, in the first row you would have to look at how much it has changed with respect to the second row and the third row which are those within the interval of a second and calculate the difference with the maximum value.
In this case, the maximum value is in the second row "09:31:24.200", the maximum variation will be 2 - 0.
Then, we will create a new column with all these maximum variations for each of the rows.
df
| | B | Maximum Variation |
|-------------------------|------|--------------------|
| 2013-01-01 09:31:23.999 | 0.0 | 2.0 |
| 2013-01-01 09:31:24.200 | 2.0 | 1.0 |
| 2013-01-01 09:31:24.250 | 1.0 | 0.0 |
| 2013-01-01 09:31:25.000 | NaN | 4.0 |
| 2013-01-01 09:31:25.375 | 4.0 |-3.0 |
| 2013-01-01 09:31:25.850 | 1.0 | 2.0 |
| 2013-01-01 09:31:26.100 | 3.0 | 0.0 |
| 2013-01-01 09:31:27.150 | 10.0 | 0.0 |
| 2013-01-01 09:31:28.050 | NaN | 3.0 |
| 2013-01-01 09:31:28.850 | 3.0 | 3.0 |
| 2013-01-01 09:31:29.200 | 6.0 | 0.0 |
I hope it's clear enough
Solution has been found and shared in the answers, but still an efficiency improvement in this solution that doesn't involve having to make a loop for each row of the df, will be more than welcome
I've finally found the solution:
df = pd.DataFrame({'B': [0, 1, 2, 8, 6, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
df = df.reset_index()
df = df.rename(columns={"index": "start_date"})
df['duration_in_seconds'] = 1
df['end_date'] = df['start_date'] + pd.to_timedelta(df['duration_in_seconds'], unit='s')
df['max'] = np.nan
for index, row in df.iterrows():
start = row['start_date']
end = row['end_date']
maxi = df[(df['start_date'] >= start ) & (df['start_date'] <= end)]['B'].max()
df.iloc[index, df.columns.get_loc('max')] = maxi
df['Maximum Variation'] = df['max'] - df['B']
df
| | start_date | B | duration_in_seconds | end_date | max | Maximum Variation |
|----|-------------------------|------|---------------------|-------------------------|------|-------------------|
| 0 | 2013-01-01 09:31:23.999 | 0.0 | 1 | 2013-01-01 09:31:24.999 | 2.0 | 2.0 |
| 1 | 2013-01-01 09:31:24.200 | 1.0 | 1 | 2013-01-01 09:31:25.200 | 8.0 | 7.0 |
| 2 | 2013-01-01 09:31:24.250 | 2.0 | 1 | 2013-01-01 09:31:25.250 | 8.0 | 6.0 |
| 3 | 2013-01-01 09:31:25.000 | 8.0 | 1 | 2013-01-01 09:31:26.000 | 8.0 | 0.0 |
| 4 | 2013-01-01 09:31:25.375 | 6.0 | 1 | 2013-01-01 09:31:26.375 | 6.0 | 0.0 |
| 5 | 2013-01-01 09:31:25.850 | 1.0 | 1 | 2013-01-01 09:31:26.850 | 3.0 | 2.0 |
| 6 | 2013-01-01 09:31:26.100 | 3.0 | 1 | 2013-01-01 09:31:27.100 | 3.0 | 0.0 |
| 7 | 2013-01-01 09:31:27.150 | 10.0 | 1 | 2013-01-01 09:31:28.150 | 10.0 | 0.0 |
| 8 | 2013-01-01 09:31:28.050 | NaN | 1 | 2013-01-01 09:31:29.050 | 3.0 | NaN |
| 9 | 2013-01-01 09:31:28.850 | 3.0 | 1 | 2013-01-01 09:31:29.850 | 6.0 | 3.0 |
| 10 | 2013-01-01 09:31:29.200 | 6.0 | 1 | 2013-01-01 09:31:30.200 | 6.0 | 0.0 |
More time efficient solutions are still welcome
More efficient solution
df = df.reset_index()
df = df.rename(columns={"index": "start_date"})
df['duration_in_seconds'] = 1
df['end_date'] = df['start_date'] + pd.to_timedelta(df['duration_in_seconds'], unit='s')
df['max'] = np.nan
df["max"] = df.apply(lambda row : df.loc[(df["start_date"] >= row['start_date']) & (df["start_date"] <=row['end_date'])]["B"].max(), axis = 1)
df['Maximum Variation'] = df['max'] - df['B']
import numpy as np
import pandas as pd
df = pd.DataFrame({'B': [0, 2, 1, np.nan, 4, 1, 3, 10, np.nan, 3, 6]},
index = [pd.Timestamp('20130101 09:31:23.999'),
pd.Timestamp('20130101 09:31:24.200'),
pd.Timestamp('20130101 09:31:24.250'),
pd.Timestamp('20130101 09:31:25.000'),
pd.Timestamp('20130101 09:31:25.375'),
pd.Timestamp('20130101 09:31:25.850'),
pd.Timestamp('20130101 09:31:26.100'),
pd.Timestamp('20130101 09:31:27.150'),
pd.Timestamp('20130101 09:31:28.050'),
pd.Timestamp('20130101 09:31:28.850'),
pd.Timestamp('20130101 09:31:29.200')])
print(df)
B
2013-01-01 09:31:23.999 0.0
2013-01-01 09:31:24.200 2.0
2013-01-01 09:31:24.250 1.0
2013-01-01 09:31:25.000 NaN
2013-01-01 09:31:25.375 4.0
2013-01-01 09:31:25.850 1.0
2013-01-01 09:31:26.100 3.0
2013-01-01 09:31:27.150 10.0
2013-01-01 09:31:28.050 NaN
2013-01-01 09:31:28.850 3.0
2013-01-01 09:31:29.200 6.0
df_min = df.resample('1S').min()
print(df_min)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 1.0
2013-01-01 09:31:25 1.0
2013-01-01 09:31:26 3.0
2013-01-01 09:31:27 10.0
2013-01-01 09:31:28 3.0
2013-01-01 09:31:29 6.0
df_max = df.resample('1S').max()
print(df_max)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 2.0
2013-01-01 09:31:25 4.0
2013-01-01 09:31:26 3.0
2013-01-01 09:31:27 10.0
2013-01-01 09:31:28 3.0
2013-01-01 09:31:29 6.0
df_diff = df_max - df_min
print(df_diff)
B
2013-01-01 09:31:23 0.0
2013-01-01 09:31:24 1.0
2013-01-01 09:31:25 3.0
2013-01-01 09:31:26 0.0
2013-01-01 09:31:27 0.0
2013-01-01 09:31:28 0.0
2013-01-01 09:31:29 0.0
Related
I have a set of data frames df1, df2, ... dfn
dfs are is like:
id | date | metric_value
001 | 2013-01-01 | 0.73
001 | 2013-03-01 | 0.73
002 | 2013-01-01 | 0.73
002 | 2013-02-01 | 0.73
But there is not necessarily a match between the id and date column, so I could have a df1 like:
id | date | metric_value1
001 | 2013-01-01 | 0.73
001 | 2013-03-01 | 0.73
002 | 2013-01-01 | 0.73
002 | 2013-02-01 | 0.73
004 | 2013-03-01 | 0.73
And a df2 like:
id | date | metric_value2
001 | 2013-01-01 | 0.72
003 | 2013-02-01 | 0.72
003 | 2013-03-01 | 0.72
004 | 2013-01-01 | 0.72
How could I merge df1 and df2, generally speaking df1 ... dfn, so I could have something like:
id | date | metric_value1 | metric_value2
001 | 2013-01-01 | 0.73 | 0.72
001 | 2013-02-01 | Nan | Nan
001 | 2013-03-01 | 0.73 | Nan
002 | 2013-01-01 | 0.73 | Nan
002 | 2013-02-01 | 0.73 | Nan
002 | 2013-03-01 | Nan | Nan
003 | 2013-01-01 | Nan | Nan
003 | 2013-02-01 | Nan | 0.72
003 | 2013-03-01 | Nan | 0.72
004 | 2013-01-01 | Nan | 0.72
004 | 2013-02-01 | Nan | Nan
004 | 2013-03-01 | 0.73 | Nan
To cover all Ids, in the entire range of date, from min date, to max date
Taking #JonathanLeon solution a little further:
import io
import pandas as pd
data='''id|date|metric_value1
001|2013-01-01|0.73
001|2013-03-01|0.73
002|2013-01-01|0.73
002|2013-02-01|0.73
004|2013-03-01|0.73'''
df1 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
data='''id|date|metric_value2
001|2013-01-01|0.72
003|2013-02-01|0.72
003|2013-03-01|0.72
004|2013-01-01|0.72'''
df2 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
df_out = df1.merge(df2, on=['id', 'date'], how='outer')
df_out['date'] = pd.to_datetime(df_out['date'])
df_out.set_index(['id', 'date'])\
.reindex(pd.MultiIndex.from_product([df_out['id'].unique(),
df_out['date'].unique()],
names=['id', 'date']))\
.sort_index()
.reset_index()
Output:
id date metric_value1 metric_value2
0 1 2013-01-01 0.73 0.72
1 1 2013-02-01 NaN NaN
2 1 2013-03-01 0.73 NaN
3 2 2013-01-01 0.73 NaN
4 2 2013-02-01 0.73 NaN
5 2 2013-03-01 NaN NaN
6 3 2013-01-01 NaN NaN
7 3 2013-02-01 NaN 0.72
8 3 2013-03-01 NaN 0.72
9 4 2013-01-01 NaN 0.72
10 4 2013-02-01 NaN NaN
11 4 2013-03-01 0.73 NaN
Try:
data='''id|date|metric_value1
001|2013-01-01|0.73
001|2013-03-01|0.73
002|2013-01-01|0.73
002|2013-02-01|0.73
004|2013-03-01|0.73'''
df1 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
data='''id|date|metric_value2
001|2013-01-01|0.72
003|2013-02-01|0.72
003|2013-03-01|0.72
004|2013-01-01|0.72'''
df2 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
df1.merge(df2, on=['id', 'date'], how='outer')
Output:
id date metric_value1 metric_value2
0 1 2013-01-01 0.730 0.720
1 1 2013-03-01 0.730 NaN
2 2 2013-01-01 0.730 NaN
3 2 2013-02-01 0.730 NaN
4 4 2013-03-01 0.730 NaN
5 3 2013-02-01 NaN 0.720
6 3 2013-03-01 NaN 0.720
7 4 2013-01-01 NaN 0.720
import pandas
import datetime
#build your list of unique ids
ids = pandas.concat([df1['id'], df2['id']])
ids = pandas.Series(ids.unique())
#can do as above to get all possible dates, I've just generated them.
dates = pandas.DataFrame(pandas.date_range(datetime.date.today(), freq='D', periods = 10), columns=['date'])
#use merge to generate the cartesian product of all dates and all ids
combinations = pandas.merge(left=dates, right=pandas.DataFrame(ids.unique(), columns=['id']), how='outer', left_index=True, right_index=True)
#merge your dataframes on your 'key' columns
df3 = pandas.merge(left=dates, right=df1, on=['date', 'id'], how='left')
df4 = pandas.merge(left=dates, right=df2, on=['date', 'id'], how='left')
I have a file that looks like this:
Date | col1 | col2 | col3
2010-01-01 | -1.4 | 0.0 | 0.0
2010-01-01 | -1.4 | 0.0 | 0.0
2010-01-01 | -2.4 | 0.0 | 0.66
2010-01-02 | -2.4 | 0.0 | 0.08
2010-01-02 | -4.3 | 0.0 | 0.1
2010-01-02 | -4.3 | 0.0 | 1.04
Same days refer to a specific city, so for 2010-01-01 there is data for 3 cities, same for 2010-01-02 and all other days (it's always the same amount, at the moment 13 cities = 13 rows per day).
The city names are in a list where the order of the cities is the same as the order of the days:
["city1", "city2", "city3"]
So "city1" is the first row for each day, then "city2", then "city3" and so on.
I need to get this format into a standard format where I can set the Date as the index, so need a format like this:
Date | city1_col1 | city1_col2 | city1_col3 | city2_col1| city2_col2 | city2_col3 | city3_col1| city3_col2 | city3_col3
2010-01-01 | -1.4 | 0.0 | 0.0 | -1.4 | 0.0 | 0.0 | -2.4 | 0.0 | 0.66
2010-01-02 | -2.4 | 0.0 | 0.08 | -4.3 | 0.0 | 0.1 | -4.3 | 0.0 | 1.04
The data is later merged with other dataframes where the indexes are also the days of the year so a multiindex won't work.
How can I achieve this with pandas?
Here's a way to do that:
df["city"] = cities * (len(df) // len(cities))
df = pd.pivot_table(df, index="Date", columns="city")
df.columns = [c[1] + "_" + c[0] for c in df.columns]
df=df.sort_index(axis=1)
The output is:
city1_col1 city1_col2 city1_col3 city2_col1 city2_col2 city2_col3 city3_col1 city3_col2 city3_col3
Date
2010-01-01 -1.4 0.0 0.00 -1.4 0.0 0.0 -2.4 0.0 0.66
2010-01-02 -2.4 0.0 0.08 -4.3 0.0 0.1 -4.3 0.0 1.04
I have a pivoted pandas dataframe that looks like the one below.
I need to unpivot it into a dataframe indexed by datetime, and the variables (columns) reduced to only one of each.
I tried using melt but I am struggling to reshape it because of the hour row.
What would be the best option to reshape such a dataframe?
The dataframe I have
+----------+------+------+------+------+------+
| nan | var1 | var1 | var2 | var2 | var3 |
+----------+------+------+------+------+------+
| Hour | 2 | 3 | 0 | 2 | 0 |
| 1/1/2019 | 0.8 | 0.4 | 0.6 | 0.9 | 0.7 |
| 1/2/2019 | 0.2 | 0.2 | 0.7 | 0.3 | 0.1 |
| 1/3/2019 | 0.1 | 0.0 | 0.3 | 0.4 | 1.0 |
+----------+------+------+------+------+------+
The dataframe I need to get
+---------------+------+------+------+
| Datetime | var1 | var2 | var3 |
+---------------+------+------+------+
| 1/1/2019 0:00 | NaN | 0.6 | 0.7 |
| 1/1/2019 1:00 | NaN | NaN | NaN |
| 1/1/2019 2:00 | 0.8 | 0.9 | NaN |
| 1/1/2019 3:00 | 0.4 | NaN | NaN |
| 1/2/2019 0:00 | NaN | 0.7 | 0.1 |
| 1/2/2019 1:00 | NaN | NaN | NaN |
| 1/2/2019 2:00 | 0.2 | 0.3 | NaN |
| 1/2/2019 3:00 | 0.2 | NaN | NaN |
| 1/3/2019 0:00 | NaN | 0.3 | 1.0 |
| 1/3/2019 1:00 | NaN | NaN | NaN |
| 1/3/2019 2:00 | 0.1 | 0.4 | NaN |
| 1/3/2019 3:00 | 0.0 | NaN | NaN |
+---------------+------+------+------+
Here's a really shitty answer that is unidiomatic pandas, but gets the job done given the data you presented in the format you presented in. If you have massive amount of data I highly recommend you find a more optimized way.
dff = df.copy()
mn, mx = df.loc['Hour'].agg([min, max]).astype(int)
df = df.loc[df.index.repeat(mx-mn+1)]
df = df.loc[df.index != 'Hour']
df = df.assign(time=list(range(mn,mx+1))*(mx-mn))
df = df.set_index('time', append=True).iloc[:,:0]
for i,v in enumerate(dff.columns):
d = dff.iloc[:, i].to_frame()
hour = d.at['Hour', v]
for idx, row in d.iloc[1:].iterrows():
df.loc[(idx, hour), v] = row[v]
df = df.reset_index().rename(columns={0: 'date'})
df['datetime'] = df[['date', 'time']].apply(lambda x: f"{x['date']} {x['time']}:00", axis=1)
df = df.drop(columns=['date', 'time']).set_index('datetime').reset_index()
print(df)
datetime v1 v2 v3
0 1/1/2019 0:00 NaN 0.6 0.7
1 1/1/2019 1:00 NaN NaN NaN
2 1/1/2019 2:00 0.8 0.9 NaN
3 1/1/2019 3:00 0.4 NaN NaN
4 1/2/2019 0:00 NaN 0.7 0.1
5 1/2/2019 1:00 NaN NaN NaN
6 1/2/2019 2:00 0.2 0.3 NaN
7 1/2/2019 3:00 0.2 NaN NaN
8 1/3/2019 0:00 NaN 0.3 1.0
9 1/3/2019 1:00 NaN NaN NaN
10 1/3/2019 2:00 0.1 0.4 NaN
11 1/3/2019 3:00 0.0 NaN NaN
I am trying to create 3 new columns in a dataframe, which are based on previous pairs information.
You can think of the dataframe as the results of comptetion ('xx' column) within diffrerent types ('type' column) at different dates ('date column).
The idea is to create the following new columns:
(i) numb_comp_past: sum of the number of times every type faced the competitors in the past.
(ii) win_comp_past: sum of the win (+1), ties (+0), and loss (-1) of the previous competitions that all the types competing with each other had in the past.
(iii) win_comp_past_difs: sum of difference of the results of the previous competitions that all the types competing with each other had in the past.
The original dataframe (df) is the following:
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')
And it looks like this:
xx
date type
2018-01-01 A 1.0
B 5.0
2018-02-01 B 3.0
2018-03-01 A 2.0
B 7.0
C 3.0
D 1.0
E 6.0
2018-05-01 B 3.0
2018-06-01 A 5.0
B 2.0
C 3.0
2018-07-01 A 1.0
2018-08-01 B 9.0
C 3.0
2018-09-01 A 2.0
B 7.0
2018-10-01 C 3.0
A 6.0
B 8.0
2018-11-01 A 2.0
2018-12-01 B 7.0
C 9.0
The 3 new columns I need to add to the dataframe are shown below (expected output of the Pandas code):
xx numb_comp_past win_comp_past win_comp_past_difs
date type
2018-01-01 A 1.0 0.0 0.0 0.0
B 5.0 0.0 0.0 0.0
2018-02-01 B 3.0 0.0 0.0 0.0
2018-03-01 A 2.0 1.0 -1.0 -4.0
B 7.0 1.0 1.0 4.0
C 3.0 0.0 0.0 0.0
D 1.0 0.0 0.0 0.0
E 6.0 0.0 0.0 0.0
2018-05-01 B 3.0 0.0 0.0 0.0
2018-06-01 A 5.0 3.0 -3.0 -10.0
B 2.0 3.0 3.0 13.0
C 3.0 2.0 0.0 -3.0
2018-07-01 A 1.0 0.0 0.0 0.0
2018-08-01 B 9.0 2.0 0.0 3.0
C 3.0 2.0 0.0 -3.0
2018-09-01 A 2.0 3.0 -1.0 -6.0
B 7.0 3.0 1.0 6.0
2018-10-01 C 3.0 5.0 -1.0 -10.0
A 6.0 6.0 -2.0 -10.0
B 8.0 7.0 3.0 20.0
2018-11-01 A 2.0 0.0 0.0 0.0
2018-12-01 B 7.0 4.0 2.0 14.0
C 9.0 4.0 -2.0 -14.0
Note that:
(i) for numb_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of 3 given that he previously competed with type B on 2018-01-01 and 2018-03-01 and with type C on 2018-03-01.
(ii) for win_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -3 given that he previously lost with type B on 2018-01-01 (-1) and 2018-03-01 (-1) and with type C on 2018-03-01 (-1). Thus adding -1-1-1=-3.
(iii) for win_comp_past_value if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -10 given that he previously lost with type B on 2018-01-01 by a difference of -4 (=1-5) and on 2018-03-01 by a diffrence of -5 (=2-7) and with type C on 2018-03-01 by -1 (=2-3). Thus adding -4-5-1=-10.
I really don't know how to start solving this problem. Any ideas of how to solve the new columns decribed in (i), (ii) and (ii) are very welcome.
Here's my take:
# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
teams = x.index.get_level_values(1)
tmp = pd.DataFrame(x[:,None]-x[None,:],
columns = teams.values,
index=teams.values).stack()
return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()
# win matches
new_df['win'] = new_df.xx.ge(0).astype(int) - new_df.xx.le(0).astype(int)
# group by players
groups = new_df.groupby(level=[1,2])
# sum function
def cumsum_shift(x):
return x.cumsum().shift()
# assign new values
df['num_comp_past'] = groups.xx.cumcount().sum(level=[0,1])
df['win_comp_past'] = groups.win.apply(cumsum_shift).sum(level=[0,1])
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
Output:
+------------+------+-----+---------------+---------------+--------------------+
| | | xx | num_comp_past | win_comp_past | win_comp_past_difs |
+------------+------+-----+---------------+---------------+--------------------+
| date | type | | | | |
+------------+------+-----+---------------+---------------+--------------------+
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
+------------+------+-----+---------------+---------------+--------------------+
I would like to add columns to a time-indexed pandas DataFrame which contain the rate of change over the last n hours for each of the existing columns. I have accomplished this with the following code, however, it is too slow for my needs (probably due to looping over every index of each column?).
Is there a (computationally) faster way to do this?
roc_hours = 12
tol = 1e-10
for c in ts.columns:
c_roc = c + ' +++ RoC ' + str(roc_hours) + 'h'
ts[c_roc] = np.nan
for i in ts.index[np.isfinite(ts[c])]:
df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]
X = (df.index.values - df.index.values.min()).astype('Int64')*2.77778e-13 #hours back
Y = df.values
if Y.std() > tol and X.shape[0] > 1:
fit = np.polyfit(X,Y,1)
ts[c_roc][i] = fit[0]
else:
ts[c_roc][i] = 0
Edit input dataframe ts is irregularly sampled and can contain NaNs. First few lines of input ts:
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| WCT | a | b | c | d | e | f |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
| 2011-09-04 20:00:00 | | | | | | |
| 2011-09-04 21:00:00 | | | | | | |
| 2011-09-04 22:00:00 | | | | | | |
| 2011-09-04 23:00:00 | | | | | | |
| 2011-09-05 02:00:00 | 93.0 | 97.0 | 20.0 | 209.0 | 85.0 | 98.0 |
| 2011-09-05 03:00:00 | 74.14285714285714 | 97.0 | 20.0 | 194.14285714285717 | 74.42857142857143 | 98.0 |
| 2011-09-05 04:00:00 | 67.5 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 05:00:00 | 72.0 | 98.5 | 20.0 | 176.0 | 75.0 | 98.0 |
| 2011-09-05 07:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 08:00:00 | 80.0 | 93.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 09:00:00 | 78.5 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 10:00:00 | 73.0 | 98.0 | 19.0 | 186.0 | 71.0 | 97.0 |
| 2011-09-05 11:00:00 | 77.0 | 98.0 | 18.0 | 175.0 | 87.0 | 97.0999984741211 |
| 2011-09-05 12:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
| 2011-09-05 15:00:00 | 78.0 | 98.0 | 19.0 | 163.0 | 57.0 | 98.4000015258789 |
+---------------------+-------------------+------+------+--------------------+-------------------+------------------+
Edit 2
After profiling, the bottleneck is in the slicing step: df = ts[c][i - np.timedelta64(roc_hours, 'h'):i]. This line pulls out observations time-stamped between now-roc_hours and now. It's very handy syntax, but is taking up the bulk of the compute time.
Works on a dataset of mine, haven't checked on yours:
import pandas as pd
from numpy import polyfit
from matplotlib import style
style.use('ggplot')
# ... acquire a dataframe named *water* with a column *value*
WINDOW = 10
ax=water.value.plot()
roll = pd.rolling_mean(water.value, WINDOW)
roll.plot(ax=ax)
def lintrend(df):
df = df.tolist()
m, b = polyfit(range(len(df)), df,1)
return m
linny = pd.rolling_apply(water.value, WINDOW, lintrend)
linny.plot(ax=ax)
Casting the numpy.ndarray to list after rolling_apply cast it to numpy.ndarray seems inelegant. Suggestions?