I would like to delete the rows from dataframe df1, if the current date is between the ShiftScheduledStart and ShiftScheduledEnd values. My idea was the code below, however this does not give the right result.
df1[(df1['ShiftScheduledEnd'] < CurrentDateVar) & (CurrentDateVar < df1['ShiftScheduledStart'])]
What is wrong?
Thanks!
I don't know what you're expecting but none of your rows satisfy your condition:
In [7]:
t="""ShiftScheduledEnd,ShiftScheduledStart
16-5-2015 14:30,16-5-2015 6:00
13-7-2015 22:00,13-7-2015 14:00
13-7-2015 22:30,13-7-2015 14:00
13-7-2015 22:00,13-7-2015 14:00"""
df1 = pd.read_csv(io.StringIO(t), parse_dates=[0,1])
print(df1)
CurrentDateVar = pd.to_datetime('14-7-2015 23:45')
CurrentDateVar
ShiftScheduledEnd ShiftScheduledStart
0 2015-05-16 14:30:00 2015-05-16 06:00:00
1 2015-07-13 22:00:00 2015-07-13 14:00:00
2 2015-07-13 22:30:00 2015-07-13 14:00:00
3 2015-07-13 22:00:00 2015-07-13 14:00:00
Out[7]:
Timestamp('2015-07-14 23:45:00')
In [8]:
df1[(df1['ShiftScheduledStart'] < CurrentDateVar) & (df1['ShiftScheduledEnd'] > CurrentDateVar)]
Out[8]:
Empty DataFrame
Columns: [ShiftScheduledEnd, ShiftScheduledStart]
Index: []
Related
I would like to make a subtraction with date_time in pandas python but with a shift of two rows, I don't know the function
Timestamp
2020-11-26 20:00:00
2020-11-26 21:00:00
2020-11-26 22:00:00
2020-11-26 23:30:00
Explanation:
(2020-11-26 21:00:00) - (2020-11-26 20:00:00)
(2020-11-26 23:30:00) - (2020-11-26 22:00:00)
The result must be:
01:00:00
01:30:00
Firstly you need to check if this is as type datetime.
If not, kindly do pd.to_datetime()
demo = pd.DataFrame(columns=['Timestamps'])
demotime = ['20:00:00','21:00:00','22:00:00','23:30:00']
demo['Timestamps'] = demotime
demo['Timestamps'] = pd.to_datetime(demo['Timestamps'])
Your dataframe would look like:
Timestamps
0 2020-11-29 20:00:00
1 2020-11-29 21:00:00
2 2020-11-29 22:00:00
3 2020-11-29 23:30:00
After that you can either use for loop or while and in that just do:
demo.iloc[i+1,0]-demo.iloc[i,0]
IIUC, you want to iterate on chunks of two and find the difference, one approach is to:
res = df.groupby(np.arange(len(df)) // 2).diff().dropna()
print(res)
Output
Timestamp
1 0 days 01:00:00
3 0 days 01:30:00
I have a dataframe with 3 columns:
file = glob.glob('InputFile.csv')
for i in file:
df = pd.read_csv(i)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Date X Y
0 2020-02-13 00:11:59 -91.3900 -31.7914
1 2020-02-13 01:11:59 -87.1513 -34.6838
2 2020-02-13 02:11:59 -82.9126 -37.5762
3 2020-02-13 03:11:59 -79.3558 -40.2573
4 2020-02-13 04:11:59 -73.2293 -44.2463
... ... ... ...
2034 2020-05-04 18:00:00 -36.4645 -18.3421
2035 2020-05-04 19:00:00 -36.5767 -16.8311
2036 2020-05-04 20:00:00 -36.0170 -14.9356
2037 2020-05-04 21:00:00 -36.4354 -11.0533
2038 2020-05-04 22:00:00 -40.3424 -11.4000
[2039 rows x 3 columns]
print(converted_file.dtypes)
Date datetime64[ns]
xTilt float64
yTilt float64
dtype: object
I would like the output to be:
Date X Y X_Diff Y_Diff
0 2020-02-16 00:11:59 -38.46270 -70.8352 -38.46270 -70.8352
1 2020-02-23 00:11:59 -80.70250 -7.1893 -42.23980 63.6459
2 2020-03-01 00:11:59 -47.38980 -39.2652 33.31270 -32.0759
3 2020-03-08 00:00:00 -35.65350 -64.5058 11.73630 -25.2406
4 2020-03-15 00:00:00 -43.03290 -15.8425 -7.37940 48.6633
5 2020-03-22 00:00:00 -19.77130 -25.5298 23.26160 -9.6873
6 2020-03-29 00:00:00 -13.18940 12.4093 6.58190 37.9391
7 2020-04-05 00:00:00 -8.49098 27.8407 4.69842 15.4314
8 2020-04-12 00:00:00 -19.05360 20.0445 -10.56262 -7.7962
9 2020-04-26 00:00:00 -25.61330 31.6306 -6.55970 11.5861
10 2020-05-03 00:00:00 -46.09250 -30.3557 -20.47920 -61.9863
In such a way that I would like to search from the InputFile.csv file all dates that are in Sundays and extract every first occurence of every Sunday (that is the first entry on that day and not the other times) along with the X and Y values that corresponds to that selected day. Then save it to a new dataframe where I could do subtraction in the X and Y. Copying the very first X and Y to be copied on columns X_Diff and Y_Diff, respectively. Then for the next entries of the output file, loop in all rows to get the difference of the next X minus the previous X then result will be appended in the X_Diff. Same goes with Y until the end of the file.
Here is my solution.
1. Preparation: I will need to generate some random data to be worked on.
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
The data is like this:
Date X Y
0 2020-02-13 00:00:00 -12.044751 165.962038
1 2020-02-13 01:00:00 63.537406 65.137176
2 2020-02-13 02:00:00 67.555256 114.186898
... ... ... ..
2. Filter the dataframe to get Sunday only. Then, generate another column with date only for grouping purpose.
df = df[df.Date.dt.dayofweek == 0]
df['date_only'] = df.Date.dt.date
Then, it looks like this.
Date X Y date_only
96 2020-02-17 00:00:00 26.632391 120.311315 2020-02-17
97 2020-02-17 01:00:00 -14.111209 21.543440 2020-02-17
98 2020-02-17 02:00:00 -11.941086 -51.303122 2020-02-17
99 2020-02-17 03:00:00 -48.612563 137.023917 2020-02-17
100 2020-02-17 04:00:00 133.843010 -47.168805 2020-02-17
... ... ... ... ...
1796 2020-04-27 20:00:00 -158.310600 30.149292 2020-04-27
1797 2020-04-27 21:00:00 170.212825 181.626611 2020-04-27
1798 2020-04-27 22:00:00 59.773796 11.262186 2020-04-27
1799 2020-04-27 23:00:00 -99.757428 83.529157 2020-04-27
1944 2020-05-04 00:00:00 -168.435315 245.884281 2020-05-04
3. Next step, sort the data frame by "Date". Then, group the dataframe by "date_only". After that, take the first row of each group.
df = df.sort_values(by=['Date'])
df = df.groupby('date_only').apply(lambda g: g.head(1)).reset_index(drop=True).drop(columns=['date_only'])
Results:
Date X Y
0 2020-02-17 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274
2 2020-03-02 -231.596763 -46.989246
3 2020-03-09 76.561269 -40.188202
4 2020-03-16 -18.653363 52.376442
5 2020-03-23 106.758484 22.969963
6 2020-03-30 -133.601545 185.561830
7 2020-04-06 -57.748555 -187.878427
8 2020-04-13 57.648834 10.365917
9 2020-04-20 -47.959093 177.455676
10 2020-04-27 -30.527067 -37.046330
11 2020-05-04 -52.854252 -136.069205
4. Last step, get the difference for each X/Y value with their previous value.
df['X_Diff'] = df.X.diff()
df['Y_Diff'] = df.Y.diff()
Results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 NaN NaN
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
5. If you are not happy with the "NaN" for the first row, then just fill it with the X/Y columns' original values.
df['X_Diff'] = df['X_Diff'].fillna(df.X)
df['Y_Diff'] = df['Y_Diff'].fillna(df.Y)
Final results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
Note: There is no time displayed in the "Date" field in the final result. This is because the data I generated for those dates are hourly. So, the first row of each Sunday is XXXX-XX-XX 00:00:00, and the time 00:00:00 will not be displayed in pandas, although they actually exist.
Here is the Colab Link. You can have all my code in a notebook here.
https://colab.research.google.com/drive/1ecSSvJW0waCU19KPoj5uiiYmHp9SSQOf?usp=sharing
I will create a dataframe as Christopher did:
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
Dataframe view
At First, set the datetime column as index
df = df.set_index('Date')
Secondly, get the rows only for sundays:
sunday_df= df[df.index.dayofweek == 6]
Third, resample the values to day format, take the last value of the day and remove rows with empty hours
sunday_df = sunday_df.resample('D').last().dropna()
Lastly, do the subtraction:
sunday_df['X_Diff'] = sunday_df.X.diff()
sunday_df['Y_Diff'] = sunday_df.Y.diff()
The last view of the new dataframe
Good evening,
is it possible to calculate with - let's say - two columns inside a dataframe and add a third column with the fitting result?
Dataframe (original):
name time_a time_b
name_a 08:00:00 09:00:00
name_b 07:45:00 08:15:00
name_c 07:00:00 08:10:00
name_d 06:00:00 10:00:00
Or to be specific...is it possible to obtain the difference of two times (time_b - time_a) and create a
new column (time_c) at the end of the dataframe?
Dataframe (new):
name time_a time_b time_c
name_a 08:00:00 09:00:00 01:00:00
name_b 07:45:00 08:15:00 00:30:00
name_c 07:00:00 08:10:00 01:10:00
name_d 06:00:00 10:00:00 04:00:00
Thanks and a good night!
If your columns are in datetime or timedelta format:
# New column is a timedelta object
df["time_c"] = (df["time_b"] - df["time_a"])
If your columns are in datetime.time format (which it appears they are):
def time_diff(time_1,time_2):
"""returns the difference between time 1 and time 2 (time_2-time_1)"""
now = datetime.datetime.now()
time_1 = datetime.datetime.combine(now,time_1)
time_2 = datetime.datetime.combine(now,time_2)
return time_2 - time_1
# Apply the function
df["time_c"] = df[["time_a","time_b"]].apply(lambda arr: time_diff(*arr), axis=1)
Alternatively, you can convert to a timedelta by first converting to a string:
df["time_a"]=pd.to_timedelta(df["time_a"].astype(str))
df["time_b"]=pd.to_timedelta(df["time_b"].astype(str))
df["time_c"] = df["time_b"] - df["time_a"]
I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output:
I have the following df
lst = [[1548828606206000000, 1548840373139000000],
[1548841285708000000, 1548841458405000000],
[1548842198276000000, 1548843109519000000],
[1548844022821000000, 1548844934207000000],
[1548845431090000000, 1548845539219000000],
[1548845555332000000, 1548845846621000000],
[1548847176147000000, 1548851020030000000],
[1548851704053000000, 1548852256143000000],
[1548852436514000000, 1548855900767000000],
[1548856817770000000, 1548857162183000000],
[1548858736931000000, 1548858979032000000]]
df = pd.DataFrame(lst,columns =['start','end'])
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
and I would like to get the duration of that event with start and end times per hour: e.g.
in my dummy df then for 6th hour should be 60 mins(maximum per hour) - 00:10:06 = 00:49:54. For 7th and 8th should be 1:00:00 each as the end time is 09:26:13. For 9th should be 00:26:13 plus all the intervals in the following .rows that overlap with 9th hour 09:44 - 09:41 = 3mins and 60mins -00:56 =4 mins. So the total for 9th should be 26+ 3 +4~=00:32:28
My initial apporach was to merge start and end, add dummy points every 3rd row, upsample to 1S, get the difference between rows, sum up only the actual rows. There must be a more pythonic way of doing this. Any hint would be great.
IIUC, something like this:
df.apply(lambda x: pd.to_timedelta(pd.Series(1, index=pd.date_range(x.start, x.end, freq='S'))
.groupby(pd.Grouper(freq='H')).count(), unit='S'), axis=1).sum()
Output:
2019-01-30 06:00:00 00:49:54
2019-01-30 07:00:00 01:00:00
2019-01-30 08:00:00 01:00:00
2019-01-30 09:00:00 00:32:28
2019-01-30 10:00:00 00:33:43
2019-01-30 11:00:00 00:40:24
2019-01-30 12:00:00 00:45:37
2019-01-30 13:00:00 00:45:01
2019-01-30 14:00:00 00:09:48
Freq: H, dtype: timedelta64[ns]
Or to get it down to hours, try:
df.apply(lambda r: pd.to_timedelta(pd.Series(1, index=pd.date_range(r.start, r.end, freq='S'))
.pipe(lambda x: x.groupby(x.index.hour).count()), unit='S'), axis=1)\
.sum()
Output:
6 00:49:54
7 01:00:00
8 01:00:00
9 00:32:28
10 00:33:43
11 00:40:24
12 00:45:37
13 00:45:01
14 00:09:48
dtype: timedelta64[ns]