I've a dataset which looks as follows
userid time val1 val2 val3 val4
1 2010-6-1 0:15 12 16 17 11
1 2010-6-1 0:30 11.5 14 15.2 10
1 2010-6-1 0:45 12 14 15 10
1 2010-6-1 1:00 8 11 13 0
.................................
.................................
2 2010-6-1 0:15 14 16 17 11
2 2010-6-1 0:30 11 14 15.2 10
2 2010-6-1 0:45 11 14 15 10
2 2010-6-1 1:00 9 11 13 0
.................................
.................................
3 ...................................
.................................
.................................
I want to get the average of every two rows. Expected results would be
userid time val1 val2 val3 val4
1 2010-6-1 0:30 11.75 15 16.1 10.5
1 2010-6-1 1:00 10 12.5 14 5
..............................
..............................
2 2010-6-1 0:30 12.5 15 16.1 10.5
2 2010-6-1 1:00 10 12.5 14 5
.................................
.................................
3 ...................................
.................................
.................................
At the moment my approach is
data = pd.read_csv("sample_dataset.csv")
i = 0
while i < len(data) - 1:
x = data.iloc[i:i+2].mean()
x['time'] = data.iloc[i+1]['time']
data.iloc[i] = x
i+=2
for i in range(len(data)):
if i % 2 != 1:
del data.iloc[i]
But this is very inefficient. Therefore can someone point out me a better approach to get the intended result?. In the dataset, I've more than 1000000 rows
I am using resample
df.set_index('time').resample('30Min',closed = 'right',label ='right').mean()
Out[293]:
val1 val2 val3 val4
time
2010-06-01 00:30:00 11.75 15.0 16.1 10.5
2010-06-01 01:00:00 10.00 12.5 14.0 5.0
Method 2
df.groupby(np.arange(len(df))//2).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean())
Out[308]:
time val1 val2 val3 val4
0 2010-06-01 00:30:00 11.75 15.0 16.1 10.5
1 2010-06-01 01:00:00 10.00 12.5 14.0 5.0
Update solution
df.groupby([df.userid,np.arange(len(df))//2]).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean()).reset_index(drop=True)
This solution stays in pandas, and is far more performant than the groupby-agg solution:
>>> df = pd.DataFrame({"a":range(10),
"b":range(0, 20, 2),
"c":pd.date_range('2018-01-01', periods=10, freq='H')})
>>> df
a b c
0 0 0 2018-01-01 00:00:00
1 1 2 2018-01-01 01:00:00
2 2 4 2018-01-01 02:00:00
3 3 6 2018-01-01 03:00:00
4 4 8 2018-01-01 04:00:00
5 5 10 2018-01-01 05:00:00
6 6 12 2018-01-01 06:00:00
7 7 14 2018-01-01 07:00:00
8 8 16 2018-01-01 08:00:00
9 9 18 2018-01-01 09:00:00
>>> pd.concat([(df.iloc[::2, :2] + df.iloc[1::2, :2].values) / 2,
df.iloc[::2, 2]], axis=1)
a b c
0 0.5 1.0 2018-01-01 00:00:00
2 2.5 5.0 2018-01-01 02:00:00
4 4.5 9.0 2018-01-01 04:00:00
6 6.5 13.0 2018-01-01 06:00:00
8 8.5 17.0 2018-01-01 08:00:00
Performance:
In [41]: n = 100000
In [42]: df = pd.DataFrame({"a":range(n), "b":range(0, n*2, 2), "c":pd.date_range('2018-01-01', periods= n, freq='S')})
In [44]: df.shape
Out[44]: (100000, 3)
In [45]: %timeit pd.concat([(df.iloc[::2, :2] + df.iloc[1::2, :2].values) / 2, df.iloc[::2, 2]], axis=1)
2.21 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [46]: %timeit df.groupby(np.arange(len(df))//2).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean())
7.9 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I tried both mentioned answers. Both did work. But Noah's answer was the fastest one as I experienced. Therefore I marked that answer as the solution.
Here is my version of Noah's answer with some explanation and edits to map with my dataset
In order to use Noah;s answer time column should be first or last (I maybe wrong). Therefore, I moved the time column to end
col = data.columns.tolist()
tmp = col[10]
col[10] = col[1]
col[1] = tmp
data2 = data[col]
Then I did the concatenation. Here ::2 means every second column and :10 means columns from 0 to 9. And then I add the time column which is at the 10th index
x = pd.concat([(data2.iloc[::2, :10] + data2.iloc[1::2, :10].values) / 2, data2.iloc[::2, 10]], axis=1)
Related
I'm not sure what the word is for what I'm doing, but I can't just use the pandas rolling (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) function because the window is not a fixed size in terms of database indices. What I'm trying to do this:
I have a dataframe with columns UT (time in hours, but not a datetime object) and WINDS, I want to add a third column that subtracts the mean of all WINDS values that are within 12 hours of the time in the UT column. Currently, I do it like this:
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
This works fine, but it takes way too long since my dataframe has tens of thousands of entries. There must be a better way to do this, right? Please help!
If I understood correctly, you could create a fake DatetimeIndex to use for rolling.
Example data:
import pandas as pd
df = pd.DataFrame({'UT':[0.5, 1, 2, 8, 9, 12, 13, 14, 15, 24, 60, 61, 63, 100],
'WINDS':[1, 1, 10, 1, 1, 1, 5, 5, 5, 5, 5, 1, 1, 10]})
print(df)
UT WINDS
0 0.5 1
1 1.0 1
2 2.0 10
3 8.0 1
4 9.0 1
5 12.0 1
6 13.0 5
7 14.0 5
8 15.0 5
9 24.0 5
10 60.0 5
11 61.0 1
12 63.0 1
13 100.0 10
Code:
# Fake DatetimeIndex.
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
print(df)
Which gives:
UT WINDS WINDS_SUB
dt
2022-05-11 00:30:00 0.5 1 -1.500000
2022-05-11 01:00:00 1.0 1 -1.500000
2022-05-11 02:00:00 2.0 10 7.142857
2022-05-11 08:00:00 8.0 1 -2.333333
2022-05-11 09:00:00 9.0 1 -2.333333
2022-05-11 12:00:00 12.0 1 -2.333333
2022-05-11 13:00:00 13.0 5 0.875000
2022-05-11 14:00:00 14.0 5 1.714286
2022-05-11 15:00:00 15.0 5 1.714286
2022-05-12 00:00:00 24.0 5 0.000000
2022-05-13 12:00:00 60.0 5 2.666667
2022-05-13 13:00:00 61.0 1 -1.333333
2022-05-13 15:00:00 63.0 1 -1.333333
2022-05-15 04:00:00 100.0 10 0.000000
The result on this small test set matches the output of your code. This assumes UT is representing hours from a certain start timepoint, which seems to be the case by looking at your solution.
Runtime:
I tested it on the following df with 30,000 rows:
import numpy as np
df = pd.DataFrame({'UT':range(30000),
'WINDS':np.full(30000, 1)})
def loop(df):
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
def vector(df):
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
return df
# 10.1 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit loop(df)
# 1.69 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vector(df)
So it's about 5,000 times faster.
I have 7 dataframes (df_1, df_2, df_3,..., df_7) all with the same columns but different lengths but sometimes have the same values.
I'd like to concatenate all 7 dataframes under the conditions that:
if df_n.iloc[row_i] != df_n+1.iloc[row_i] and df_n.iloc[row_i][0] < df_n+1.iloc[row_i][0]:
pd.concat([df_n.iloc[row_i], df_n+1.iloc[row_i], df_n+2.iloc[row_i],
...., df_n+6.iloc[row_i]])
Where df_n.iloc[row_i] is the ith row of the nth dataframe and df_n.iloc[row_i][0] is the first column of the ith row.
For example if we only had 2 dataframes and that len(df_1) < len(df_2) and if we used the conditions above the input would be:
df_1 df_2
index 0 1 2 index 0 1 2
0 12.12 11.0 31 0 12.2 12.6 30
1 12.3 12.1 33 1 12.3 12.1 33
2 10 9.1 33 2 13 12.1 23
3 16 12.1 33 3 13.1 12.1 27
4 14.4 13.1 27
5 15.2 13.2 28
And the output would be:
conditions -> pd.concat([df_1, df_2]):
index 0 1 2 3 4 5
0 12.12 11.0 31 12.2 12.6 30
2 10 9.1 33 13 12.1 23
4 nan 14.4 13.1 27
5 nan 15.2 13.2 28
Is there an easy way to do this?
IIUC concat first , the groupby by columns get the different , and we just implement your condition
s=pd.concat([df1,df2],1)
s1=s.groupby(level=0,axis=1).apply(lambda x : x.iloc[:,0]-x.iloc[:,1])
yourdf=s[s1.ne(0).any(1)&s1.iloc[:,0].lt(0)|s1.iloc[:,0].isnull()]
Out[487]:
0 1 2 0 1 2
index
0 12.12 11.0 31.0 12.2 12.6 30
2 10.00 9.1 33.0 13.0 12.1 23
4 NaN NaN NaN 14.4 13.1 27
5 NaN NaN NaN 15.2 13.2 28
I have the following NFL tracking data:
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
4 NaN 1 6 22.5 17.2
4 NaN 1 7 22.5 17.2
4 NaN 1 8 22.5 17.2
4 NaN 1 9 22.5 17.2
4 NaN 1 10 22.5 17.2
5 NaN 2 1 23.0 16.9
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
8 NaN 2 4 25.9 34.2
10 NaN 3 1 22.7 34.2
11 Nan 3 2 21.5 34.5
12 NaN 3 3 21.1 37.3
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
How can I filter this list to only get the rows in between the "Start" and "End" values for the Event column? To clarify, this is the data I want to filter for:
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
An explicit solution will not work because the actual dataset is very large and there is no way to predict where the Start and End values fall.
Doing with slice and ffill then concat back , Also you have Nan in your df , should it be NaN ?
df1=df.copy()
newdf=pd.concat([df1[df.Event.ffill()=='Start'],df1[df.Event=='End']]).sort_index()
newdf
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
Or
newdf=df[~((df.Event.ffill()=='End')&(df.Event.isna()))]
newdf
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
I have been searching if I could create dummies using the date indexed in pandas, but could not find anything yet.
I have a df that is indexed by date
dew temp
date
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
...
I know I could setdate as a column using,
df.reset_index(level=0, inplace=True)
and then use something like this to create dummies,
df['main_hours'] = np.where((df['date'] >= '2010-01-02 03:00:00') & (df['date'] <= '2010-01-02 05:00:00')1,0)
However, I would like to create dummy variables using indexed date on the fly without using date as a column. Is there a way in pandas like that?
Any suggestion would be appreciated.
IIUC:
df['main_hours'] = \
np.where((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'),
1,
0)
or:
In [8]: df['main_hours'] = \
((df.index >= '2010-01-02 03:00:00') &
(df.index <= '2010-01-02 05:00:00')).astype(int)
In [9]: df
Out[9]:
dew temp main_hours
date
2010-01-02 00:00:00 129.0 -16 0
2010-01-02 01:00:00 148.0 -15 0
2010-01-02 02:00:00 159.0 -11 0
2010-01-02 03:00:00 181.0 -7 1
2010-01-02 04:00:00 138.0 -7 1
Timing: for 50.000 rows DF:
In [19]: df = pd.concat([df.reset_index()] * 10**4, ignore_index=True).set_index('date')
In [20]: pd.options.display.max_rows = 10
In [21]: df
Out[21]:
dew temp
date
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
... ... ...
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
[50000 rows x 2 columns]
In [22]: %timeit ((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00')).astype(int)
1.58 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: %timeit np.where((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'), 1, 0)
1.52 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: df.shape
Out[24]: (50000, 2)
Or using between;
pd.Series(df.index).between('2010-01-02 03:00:00', '2010-01-02 05:00:00', inclusive=True).astype(int)
Out[1567]:
0 0
1 0
2 0
3 1
4 1
Name: date, dtype: int32
df = df.assign(main_hours=0)
df.loc[df.between_time(start_time='3:00', end_time='5:00').index, 'main_hours'] = 1
>>> df
dew temp main_hours
2010-01-02 00:00:00 129 -16 0
2010-01-02 01:00:00 148 -15 0
2010-01-02 02:00:00 159 -11 0
2010-01-02 03:00:00 181 -7 1
2010-01-02 04:00:00 138 -7 1
I have the following pandas dataframe:
import numpy as np
import pandas as pd
dfw = pd.DataFrame({"id": ["A", "B"],
"start_date": pd.to_datetime(["2012-01-01", "2013-02-13"], format="%Y-%m-%d"),
"end_date": pd.to_datetime(["2012-04-17", "2014-11-18"], format="%Y-%m-%d")})
Result:
end_date id start_date
2012-04-17 A 2012-01-01
2014-11-18 B 2013-02-13
I am looking for the most efficient way to transform this dataframe to the following dataframe:
dates = np.empty(0, dtype="datetime64[M]")
dates = np.append(dates, pd.date_range(start="2012-01-01", end="2012-06-01", freq="MS").astype("object"))
dates = np.append(dates, pd.date_range(start="2013-02-01", end="2014-12-01", freq="MS").astype("object"))
dfl = pd.DataFrame({"id": np.repeat(["A", "B"], [6, 23]),
"counter": np.concatenate((np.arange(0, 6, dtype="float"), np.arange(0, 23, dtype="float"))),
"date": pd.to_datetime(dates, format="%Y-%m-%d")})
Result:
counter date id
0.0 2012-01-01 A
1.0 2012-02-01 A
2.0 2012-03-01 A
3.0 2012-04-01 A
4.0 2012-05-01 A
0.0 2013-02-01 B
1.0 2013-03-01 B
2.0 2013-04-01 B
3.0 2013-05-01 B
4.0 2013-06-01 B
5.0 2013-07-01 B
6.0 2013-08-01 B
7.0 2013-09-01 B
8.0 2013-10-01 B
9.0 2013-11-01 B
10.0 2013-12-01 B
11.0 2014-01-01 B
12.0 2014-02-01 B
13.0 2014-03-01 B
14.0 2014-04-01 B
15.0 2014-05-01 B
16.0 2014-06-01 B
17.0 2014-07-01 B
18.0 2014-08-01 B
19.0 2014-09-01 B
20.0 2014-10-01 B
21.0 2014-11-01 B
22.0 2014-12-01 B
A naive solution I came up so far is the following function:
def expand(df):
dates = np.empty(0, dtype="datetime64[ns]")
ids = np.empty(0, dtype="object")
counter = np.empty(0, dtype="float")
for name, group in df.groupby("id"):
start_date = group["start_date"].min()
start_date = pd.to_datetime(np.array(start_date, dtype="datetime64[M]").tolist())
end_date = group["end_date"].min()
end_date = end_date + pd.Timedelta(1, unit="M")
end_date = pd.to_datetime(np.array(end_date, dtype="datetime64[M]").tolist())
tmp = pd.date_range(start=start_date, end=end_date, freq="MS", closed=None).values
dates = np.append(dates, tmp)
ids = np.append(ids, np.repeat(group.id.values[0], len(tmp)))
counter = np.append(counter, np.arange(0, len(tmp)))
dfl = pd.DataFrame({"id": ids, "counter": counter, "date": dates})
return dfl
But it is not very fast:
%timeit expand(dfw)
100 loops, best of 3: 4.84 ms per loop
normally I adivise to avoid itertuples, but in some situations it can be more intuitive. You can get fine-grained control of the endpoints via kwargs to pd.date_range if desired (e.g. to include an endpoint or not)
In [27]: result = pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date)) for r in dfw.itertuples()]).reset_index()
In [28]: result.columns = ['counter', 'date']
In [29]: result
Out[29]:
counter date
0 2012-01-01 A
1 2012-01-02 A
2 2012-01-03 A
3 2012-01-04 A
4 2012-01-05 A
5 2012-01-06 A
.. ... ...
746 2014-11-13 B
747 2014-11-14 B
748 2014-11-15 B
749 2014-11-16 B
750 2014-11-17 B
751 2014-11-18 B
[752 rows x 2 columns]
In [26]: %timeit pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date)) for r in dfw.itertuples()]).reset_index()
100 loops, best of 3: 2.15 ms per loop
Not really sure of the purpose of making this super fast. You would generally do this kind of expansion a single time.
You wanted month starts, so here is that.
In [23]: result = pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date+pd.offsets.MonthBegin(1), freq='MS', closed=None)) for r in dfw.itertuples()]).reset_index()
In [24]: result.columns=['counter', 'date']
In [25]: result
Out[25]:
counter date
0 2012-01-01 A
1 2012-02-01 A
2 2012-03-01 A
3 2012-04-01 A
4 2012-05-01 A
5 2013-03-01 B
6 2013-04-01 B
7 2013-05-01 B
8 2013-06-01 B
9 2013-07-01 B
10 2013-08-01 B
11 2013-09-01 B
12 2013-10-01 B
13 2013-11-01 B
14 2013-12-01 B
15 2014-01-01 B
16 2014-02-01 B
17 2014-03-01 B
18 2014-04-01 B
19 2014-05-01 B
20 2014-06-01 B
21 2014-07-01 B
22 2014-08-01 B
23 2014-09-01 B
24 2014-10-01 B
25 2014-11-01 B
26 2014-12-01 B
You can adjust dates like this
In [17]: pd.Timestamp('2014-01-17')-pd.offsets.MonthBegin(1)
Out[17]: Timestamp('2014-01-01 00:00:00')
In [18]: pd.Timestamp('2014-01-31')-pd.offsets.MonthBegin(1)
Out[18]: Timestamp('2014-01-01 00:00:00')
In [19]: pd.Timestamp('2014-02-01')-pd.offsets.MonthBegin(1)
Out[19]: Timestamp('2014-01-01 00:00:00')