I want to perform a sampling from a datetime series pandas using resample method. I don't understand the output I've got.
I was expecting to get a sampling of '5s' but I'm getting 17460145 rows from 100 original dataframe. How should be the correct use of resample ?
import numpy as np
import pandas as pd
def random_dates(start, end, n=100):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2022-01-01')
end = pd.to_datetime('2023-01-01')
rd=random_dates(start, end)
clas = np.random.choice(['A','B','C'],size=100)
value = np.random.randint(0,100,size=100)
df =pd.DataFrame.from_dict({'ts':rd,'cl':clas,'vl':value}).set_index('ts').sort_index()
df
Out[48]:
cl vl
ts
2022-01-04 17:25:10 B 27
2022-01-06 19:17:35 C 34
2022-01-17 22:55:25 B 1
2022-01-23 00:33:25 A 20
2022-01-27 18:26:56 A 55
.. ..
2022-12-14 07:46:50 C 22
2022-12-18 02:33:52 C 52
2022-12-22 17:35:10 A 52
2022-12-28 04:55:20 A 57
2022-12-29 03:19:00 A 60
[100 rows x 2 columns]
df.groupby(by='cl').resample('5s').mean()
Out[49]:
vl
cl ts
A 2022-01-23 00:33:25 20.0
2022-01-23 00:33:30 NaN
2022-01-23 00:33:35 NaN
2022-01-23 00:33:40 NaN
2022-01-23 00:33:45 NaN
...
C 2022-12-18 02:33:30 NaN
2022-12-18 02:33:35 NaN
2022-12-18 02:33:40 NaN
2022-12-18 02:33:45 NaN
2022-12-18 02:33:50 52.0
[17460145 rows x 1 columns]
Use pd.Grouper:
>>> df.groupby(['cl', pd.Grouper(freq='5s')]).mean()
vl
cl ts
A 2022-01-22 11:53:30 31.0
2022-02-01 21:24:55 60.0
2022-03-20 06:01:05 24.0
2022-04-03 00:04:05 55.0
2022-04-03 06:30:10 81.0
... ...
C 2022-11-23 23:17:20 92.0
2022-11-25 07:07:45 27.0
2022-12-07 00:18:05 88.0
2022-12-25 10:37:25 77.0
2022-12-28 14:29:25 33.0
[100 rows x 1 columns]
Related
I am trying to work on this requirement where I need to increment the date in weeks, here is the below code for the same:
import pandas as pd
import numpy as np
c=15
s={'week':[1,2,3,4,5,6,7,8],'Sales':[10,20,30,40,50,60,70,80]}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
O/P-
How would I be able to increment from last row of week column to get next 15 weeks?
Basically, the desired output of week starts from 2022-03-01 till the next 14 weeks.
One option is to use date_range to generate additional dates, then use set_index + reindex to append them:
p = p.set_index('week').reindex(pd.date_range('2021-01-04', periods=8+14, freq='W-MON')).rename_axis(['week']).reset_index()
Output:
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
You can modify the length of of week list with range() function and your variable c, but you will also check for the length of sales, which has to have the same number of elements:
import pandas as pd
import numpy as np
import datetime
c=15
weeks = list(range(1, c+1))
sales = [10,20,30,40,50,60,70,80]
s={'week':weeks,'Sales':sales+[None]*(len(weeks)-len(sales) if (len(weeks)-len(sales)) >=0 else 0)}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
print(p)
another option in DateOffset:
p = pd.concat([p, pd.DataFrame({'week': [p.iloc[-1,0]+pd.DateOffset(weeks=i) for i in range(1,c)]})], ignore_index=True)
>>> p
'''
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
I am trying to work on this requirement where I need to increment the date in weeks, here is the below code for the same:
import pandas as pd
import numpy as np
c=15
s={'week':[1,2,3,4,5,6,7,8],'Sales':[10,20,30,40,50,60,70,80]}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
O/P-
How would I be able to increment from last row of week column to get next 15 weeks?
Basically, the desired output of week starts from 2022-03-01 till the next 14 weeks.
One option is to use date_range to generate additional dates, then use set_index + reindex to append them:
p = p.set_index('week').reindex(pd.date_range('2021-01-04', periods=8+14, freq='W-MON')).rename_axis(['week']).reset_index()
Output:
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
You can modify the length of of week list with range() function and your variable c, but you will also check for the length of sales, which has to have the same number of elements:
import pandas as pd
import numpy as np
import datetime
c=15
weeks = list(range(1, c+1))
sales = [10,20,30,40,50,60,70,80]
s={'week':weeks,'Sales':sales+[None]*(len(weeks)-len(sales) if (len(weeks)-len(sales)) >=0 else 0)}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
print(p)
another option in DateOffset:
p = pd.concat([p, pd.DataFrame({'week': [p.iloc[-1,0]+pd.DateOffset(weeks=i) for i in range(1,c)]})], ignore_index=True)
>>> p
'''
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
I tried to merge two dataframes using panda but this is the error code that I get:
ValueError: You are trying to merge on datetime64[ns] and datetime64[ns, UTC] columns. If you wish to proceed you should use pd.concat
I have tried different solutions found online but nothing works!! The code has been provided to me and it seems to work on other PCs but not on my computer.
This is my code:
import sys
import os
from datetime import datetime
import numpy as np
import pandas as pd
# --------------------------------------------------------------------
# -- price, consumption and production --
# --------------------------------------------------------------------
fn = '../data/np_data.csv'
if os.path.isfile(fn):
df_data = pd.read_csv(fn,header=[0],parse_dates=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- temp. data --
# --------------------------------------------------------------------
fn = '../data/temp.csv'
if os.path.isfile(fn):
dtemp = pd.read_csv(fn,header=[0],parse_dates=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- price data --
# -- first date: 2014-01-13 --
# -- last date: 2020-02-01 --
# --------------------------------------------------------------------
fn = '../data/eprice.csv'
if os.path.isfile(fn):
eprice = pd.read_csv(fn,header=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- combine dataframes (and save as CSV file) --
# --------------------------------------------------------------------
#
df= df_data.merge(dtemp, on='time',how='left') ## This is where I get the error.
print(df.info())
print(eprice.info())
#
# add eprice
df = df.merge(eprice, on='date', how='left')
#
# eprice available only available on trading days
# fills in missing values, last observation is used
df = df.fillna(method='ffill')
#
# keep only the relevant time period
df = df[df.date > '2014-01-23']
df = df[df.date < '2020-02-01']
df.to_csv('../data/my_data.csv',index=False)
The datasets that have been imported look normal with expected number of columns and observations. The version that I have in Panda is 1.0.3
Edit:
this is the output (df) when I first merge df_data and dtemp.
time price_sys price_no1 ... temp_no3 temp_no4 temp_no5
0 2014-01-23 00:00:00+00:00 32.08 32.08 ... NaN NaN NaN
1 2014-01-24 00:00:00+00:00 31.56 31.60 ... -2.5 -8.7 2.5
2 2014-01-24 00:00:00+00:00 30.96 31.02 ... -2.5 -8.7 2.5
3 2014-01-24 00:00:00+00:00 30.84 30.79 ... -2.5 -8.7 2.5
4 2014-01-24 00:00:00+00:00 31.58 31.10 ... -2.5 -8.7 2.5
[5 rows x 25 columns]
This is the output for eprice before I merge:
<bound method NDFrame.head of date gas price oil price coal price carbon price
0 2014-01-24 00:00:00 66.00 107.88 79.42 6.89
1 2014-01-27 00:00:00 64.20 106.69 79.43 7.04
2 2014-01-28 00:00:00 63.75 107.41 79.29 7.20
3 2014-01-29 00:00:00 63.20 107.85 78.52 7.21
4 2014-01-30 00:00:00 62.60 107.95 78.18 7.46
... ... ... ... ...
1608 2020-03-25 00:00:00 22.30 27.39 67.81 17.51
1609 2020-03-26 00:00:00 21.55 26.34 70.35 17.35
1610 2020-03-27 00:00:00 18.90 24.93 72.46 16.39
1611 2020-03-30 00:00:00 19.20 22.76 71.63 17.06
1612 2020-03-31 00:00:00 18.00 22.74 71.13 17.68
[1613 rows x 5 columns]>
This is what happends when I merge df and eprice:
<bound method NDFrame.head of date gas price oil price coal price carbon price
0 2014-01-24 00:00:00 66.00 107.88 79.42 6.89
1 2014-01-27 00:00:00 64.20 106.69 79.43 7.04
2 2014-01-28 00:00:00 63.75 107.41 79.29 7.20
3 2014-01-29 00:00:00 63.20 107.85 78.52 7.21
4 2014-01-30 00:00:00 62.60 107.95 78.18 7.46
... ... ... ... ...
1608 2020-03-25 00:00:00 22.30 27.39 67.81 17.51
1609 2020-03-26 00:00:00 21.55 26.34 70.35 17.35
1610 2020-03-27 00:00:00 18.90 24.93 72.46 16.39
1611 2020-03-30 00:00:00 19.20 22.76 71.63 17.06
1612 2020-03-31 00:00:00 18.00 22.74 71.13 17.68
[1613 rows x 5 columns]>
time price_sys ... coal price carbon price
0 2014-01-23 00:00:00+00:00 32.08 ... NaN NaN
1 2014-01-24 00:00:00+00:00 31.56 ... NaN NaN
2 2014-01-24 00:00:00+00:00 30.96 ... NaN NaN
3 2014-01-24 00:00:00+00:00 30.84 ... NaN NaN
4 2014-01-24 00:00:00+00:00 31.58 ... NaN NaN
[5 rows x 29 columns]
Try doing df['Time'] = pd.to_datetime(df['Time'], utc = True) on both the time columns before joining (or rather the one without UTC needs to go through this!)
I am trying subtract two datetimes when there valid there is valid value for T1,T2 obtain the difference. The difference is caluclated by considering only weekdays between the dates not considering saturday and sunday.
Code works for only some rows. How can this be fixed.
T1 T2 Diff
0 2017-12-04 05:48:15 2018-01-05 12:15:22 NaN
1 2017-07-10 08:23:11 2018-01-05 15:28:22 NaN
2 2017-09-11 05:10:37 2018-01-29 15:02:07 NaN
3 2017-12-21 04:51:12 2018-01-29 16:06:43 NaN
4 2017-10-13 10:11:00 2018-02-22 16:19:04 NaN
5 2017-09-28 21:44:31 2018-01-29 12:42:02 NaN
6 2018-01-23 20:00:58 2018-01-29 14:40:33 NaN
7 2017-11-28 15:39:38 2018-01-31 11:57:04 NaN
8 2017-12-21 12:44:00 2018-01-31 13:12:37 30.0
9 2017-11-09 05:52:29 2018-01-22 11:42:01 53.0
10 2018-02-12 04:21:08 NaT NaN
df[['T1','T2','diff']].dtypes
T1 datetime64[ns]
T2 datetime64[ns]
diff float64
df['T1'] = pd.to_datetime(df['T1'])
df['T2'] = pd.to_datetime(df['t2'])
def fun(row):
if row.isnull().any():
return np.nan
ts = pd.DataFrame(pd.date_range(row["T1"],row["T2"]), columns=["date"])
ts["dow"] = ts["date"].dt.weekday
return (ts["dow"]<5).sum()
df["diff"] = df.apply(lambda x: fun(x), axis=1)
Instead of trying to check for a null value in the row, use a try/except to capture the error when it does the calculation with a null value.
This worked for me in, I think, the manner you want.
import pandas as pd
import numpy as np
df = pd.read_csv("/home/rightmire/Downloads/test.csv", sep=",")
# df = df[["m1","m2"]]
print(df)
# print(df[['m1','m2']].dtypes)
df['m1'] = pd.to_datetime(df['m1'])
df['m2'] = pd.to_datetime(df['m2'])
print(df[['m1','m2']].dtypes)
#for index, row in df.iterrows():
def fun(row):
try:
ts = pd.DataFrame(pd.date_range(row["m1"],row["m2"]), columns=["date"])
# print(ts)
ts["dow"] = ts["date"].dt.weekday
result = (ts["dow"]<5).sum()
# print("Result = ", result)
return result
except Exception as e:
# print("ERROR:{}".format(str(e)))
result = np.nan
# print("Result = ", result)
return result
df["diff"] = df.apply(lambda x: fun(x), axis=1)
print(df["diff"])
OUTPUT OF INTEREST:
dtype: object
0 275.0
1 147.0
2 58.0
3 28.0
4 95.0
5 87.0
6 4.0
7 46.0
8 30.0
9 96.0
10 NaN
11 27.0
12 170.0
13 158.0
14 79.0
Name: diff, dtype: float64
I have several .csv files which I am importing via Pandas and then work out a summary of the data (min, max, mean), ideally weekly and monthly reports. I have the following code, but just do not seem to get the month summary to work, I am sure the problem is with the timestamp conversion.
What am I doing wrong?
import pandas as pd
import numpy as np
#Format of the data that is been imported
#2017-05-11 18:29:14+00:00,264.0,987.99,26.5,23.70,512.0,11.763,52.31
df = pd.read_csv('data.csv')
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
print 'month info'
print [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
print(data.groupby('timestamp')['light'].mean())
IIUC, you almost have it, and your datetime conversion is fine. Here is an example:
Starting from a dataframe like this (which is your example row, duplicated with slight modifications):
>>> df
time x y z a b c d
0 2017-05-11 18:29:14+00:00 264.0 947.99 24.5 53.7 511.0 11.463 12.31
1 2017-05-15 18:29:14+00:00 265.0 957.99 25.5 43.7 512.0 11.563 22.31
2 2017-05-21 18:29:14+00:00 266.0 967.99 26.5 33.7 513.0 11.663 32.31
3 2017-06-11 18:29:14+00:00 267.0 977.99 26.5 23.7 514.0 11.763 42.31
4 2017-06-22 18:29:14+00:00 268.0 997.99 27.5 13.7 515.0 11.800 52.31
You can do what you did before with your datetime:
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
And then get your summaries either separately:
monthly_mean = df.groupby(pd.Grouper(key='timestamp',freq='M')).mean()
monthly_max = df.groupby(pd.Grouper(key='timestamp',freq='M')).max()
monthly_min = df.groupby(pd.Grouper(key='timestamp',freq='M')).min()
weekly_mean = df.groupby(pd.Grouper(key='timestamp',freq='W')).mean()
weekly_min = df.groupby(pd.Grouper(key='timestamp',freq='W')).min()
weekly_max = df.groupby(pd.Grouper(key='timestamp',freq='W')).max()
# Examples:
>>> monthly_mean
x y z a b c d
timestamp
2017-05-31 265.0 957.99 25.5 43.7 512.0 11.5630 22.31
2017-06-30 267.5 987.99 27.0 18.7 514.5 11.7815 47.31
>>> weekly_mean
x y z a b c d
timestamp
2017-05-14 264.0 947.99 24.5 53.7 511.0 11.463 12.31
2017-05-21 265.5 962.99 26.0 38.7 512.5 11.613 27.31
2017-05-28 NaN NaN NaN NaN NaN NaN NaN
2017-06-04 NaN NaN NaN NaN NaN NaN NaN
2017-06-11 267.0 977.99 26.5 23.7 514.0 11.763 42.31
2017-06-18 NaN NaN NaN NaN NaN NaN NaN
2017-06-25 268.0 997.99 27.5 13.7 515.0 11.800 52.31
Or aggregate them all together to get a multi-indexed dataframe with your summaries:
monthly_summary = df.groupby(pd.Grouper(key='timestamp',freq='M')).agg(['mean', 'min', 'max'])
weekly_summary = df.groupby(pd.Grouper(key='timestamp',freq='W')).agg(['mean', 'min', 'max'])
# Example of summary of row 'x':
>>> monthly_summary['x']
mean min max
timestamp
2017-05-31 265.0 264.0 266.0
2017-06-30 267.5 267.0 268.0
>>> weekly_summary['x']
mean min max
timestamp
2017-05-14 264.0 264.0 264.0
2017-05-21 265.5 265.0 266.0
2017-05-28 NaN NaN NaN
2017-06-04 NaN NaN NaN
2017-06-11 267.0 267.0 267.0
2017-06-18 NaN NaN NaN
2017-06-25 268.0 268.0 268.0