creating pandas-vectorized 'subtraction' table - python

I have a Series with a DatetimeIndex and an integer value. I want to make a table that shows the change in value from each time to all the other subsequent times.
Below is a visual representation of what I want. The gray and orange cells are irrelevant data.
I can't figure out a way to create this in a vectorized style inside pandas.
z = pd.DatetimeIndex(periods=10, freq='H', start='2018-12-1')
import random
df = pd.DataFrame(random.sample(range(1, 100), 10), index=z, columns=['foo'])
I've tried things like:
df['foo'].sub(df['foo'].transpose())
But that doesn't work.
The output DataFrame could either have a multindex (beforeTime, AfterTime) or could be a single index "beforeTime" and then have a column for each possible "aftertime". I think they're equivalent, as I can use the unstack() and related functions to get the shape I want?

I think you can use np.substract with np.outer to calculate all the values and create the dataframe like:
df_output = pd.DataFrame(np.subtract.outer(df.foo, df.foo),
columns= df.index.time, index=df.index.time)
print (df_output.head())
00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 \
00:00:00 0 6 -7 -57 -33 3
01:00:00 -6 0 -13 -63 -39 -3
02:00:00 7 13 0 -50 -26 10
03:00:00 57 63 50 0 24 60
04:00:00 33 39 26 -24 0 36
06:00:00 07:00:00 08:00:00 09:00:00
00:00:00 -53 -28 5 17
01:00:00 -59 -34 -1 11
02:00:00 -46 -21 12 24
03:00:00 4 29 62 74
04:00:00 -20 5 38 50
You can use np.triu to set to 0 all the values in grey in your example such as:
pd.DataFrame(np.triu(np.subtract.outer(df.foo, df.foo)), columns = ...)
Note the .time is not necessary when creating the columns= and index=, it was to copy and paste a dataframe readable

Related

filter data based on time (HH:MM:SS)

I have below data set
Time
Value
09:15:00
25
10:15:00
45
09:15:00
32
10:15:00
36
09:15:00
56
10:15:00
78
I would like to create a separate dataframe each based on the time
df0915:
Time
Value
09:15:00
25
09:15:00
32
09:15:00
56
df1015:
Time
Value
10:15:00
45
10:15:00
36
10:15:00
78
Any help?
You can use pandas.DataFrame.groupby with a list comprehension.
out = [d for _, d in df.groupby('Time')]
# Output :
print(out)
[ Time Value
0 09:15:00 25
2 09:15:00 32
4 09:15:00 56,
Time Value
1 10:15:00 45
3 10:15:00 36
5 10:15:00 78]
To access one of the dataframes, use out[0] or out[1], ....
This will filter the dataframe:
df0915 = df[df['Time'] == '09:15:00']
df1015 = df[df['Time'] == '10:15:00']
print(df0915)
print(df1015)
Output:
Time Value
0 09:15:00 25
2 09:15:00 32
4 09:15:00 56
Time Value
1 10:15:00 45
3 10:15:00 36
5 10:15:00 78

remove [] from output values in a dataframe

I'm pulling data from google trends and the output values come out as follows:
date value
0 2017-01-01 03:00:00 [65]
1 2017-01-01 03:01:00 [66]
2 2017-01-01 03:02:00 [77]
3 2017-01-01 03:03:00 [64]
4 2017-01-01 03:04:00 [94]
I've trimmed what I don't need. My issue is I need to remove the brackets and make the calue column an int. I've tried the following:
result['value'].apply(lambda x: pd.Series(str(x).replace('[', '').replace(']', '')))
But I get the same output either way. Any thoughts or suggestions?
You can also do:
df['value'] = df['value'].explode()
Output:
date value
0 2017-01-01 03:00:00 65
1 2017-01-01 03:01:00 66
2 2017-01-01 03:02:00 77
3 2017-01-01 03:03:00 64
4 2017-01-01 03:04:00 94
You can do
df['value'] = df['value'].str[0]

Filtering dataframe given a list of dates

I have the following dataframe:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
3 1999-10-05 12:00:00 53
4 1999-10-10 16:00:00 43
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
I have a datetime list that I get from tolist() in another dataframe.
[Timestamp('1999-10-01 00:00:00'),
Timestamp('1999-10-02 00:00:00'),
Timestamp('1999-10-24 00:00:00')]
The tolist() purpose is to filter the dataframe based on the dates inside the list. The end result is:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
Where only 1st, 2nd and 24th Oct rows will appear in the dataframe.
What is the approach to do this? I have looked up and only see solution to filter between dates or a singular date.
Thank you.
If want compare Timestamp without times use Series.dt.normalize:
df1 = df[df['Date'].dt.normalize().isin(L)]
Or Series.dt.floor :
df1 = df[df['Date'].dt.floor('d').isin(L)]
For compare by dates is necessary convert also list to dates:
df1 = df[df['Date'].dt.date.isin([x.date for x in L])]
print (df1)
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21

Pandas - Compute data from a column to another

Considering the following dataframe:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df.tail(6)
week extra_hours extra_hours_over
12 3 01:00:00 NaN
13 3 02:00:00 NaN
14 3 01:00:00 NaN
15 3 02:00:00 NaN
16 3 00:00:00 NaN
17 3 00:00:00 NaN
Now, in every week, the maximum amount of extra_hours is 4h, meaning I have to subtract 30min blocks from extra_hour column, and fill the extra_hour_over column, so that in every week, total sum of extra_hour has a maximum of 4h.
So, given the example dataframe, a possible solution (for week 3) would be like this:
week extra_hours extra_hours_over
12 3 01:00:00 00:00:00
13 3 01:30:00 00:30:00
14 3 00:30:00 00:30:00
15 3 01:00:00 01:00:00
16 3 00:00:00 00:00:00
17 3 00:00:00 00:00:00
I would need to aggregate total extra_hours per week, check in which days it passes 4h, and then randomly subtract half-hour chunks.
What would be the easiest/most direct way to achieve this?
Here goes one attempt for what you seem to be asking. The idea is simple, although the code fairly verbose:
1) Create some helper variables (minutes, extra_minutes, total for the week)
2) Loop through a temporary dataset that will contain only while sum is > 240 minutes.
3) In the loop, use random.choice to select a time to remove 30 min from.
4) Apply the changes to minutes and extra minutes
The code:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df['minutes'] = pd.DatetimeIndex(df['extra_hours']).hour * 60 + pd.DatetimeIndex(df['extra_hours']).minute
df['extra_minutes'] = 0
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
while not df[df['tot_time'] > 240].empty:
mask = df[(df['minutes']>=30)&(df['tot_time']>240)].groupby('week').apply(lambda x: np.random.choice(x.index)).values
df.loc[mask,'minutes'] -= 30
df.loc[mask,'extra_minutes'] += 30
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
df['extra_hours_over'] = df['extra_minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df['extra_hours'] = df['minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df.drop(['minutes','extra_minutes'], axis=1).tail(6)
Out[1]:
week extra_hours extra_hours_over tot_time
12 3 00:30:00 00:30:00 240
13 3 01:30:00 00:30:00 240
14 3 00:30:00 00:30:00 240
15 3 01:30:00 00:30:00 240
16 3 00:00:00 00:00:00 240
17 3 00:00:00 00:00:00 240
Note: Because I am using np.random.choice, the same observation can be picked twice, which will make that observation change by a chunk of more than 30 min.

matplotlib plots strange horizontal lines on graph

I have used openpyxl to read data from an Excel spreadsheet into a pandas data frame, called 'tides'. The dataset contains over 32,000 rows of data (of tide times in the UK measured every 15 minutes). One of the columns contains date and time information (variable called 'datetime') and another contains the height of the tide (called 'tide'):
I want to plot datetime along the x-axis and tide on the y axis using:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import openpyxl
import datetime as dt
from matplotlib.dates import date2num
<-- Data imported from Excel spreadsheet into DataFrame using openpyxl. -->
<-- Code omitted for ease of reading. -->
# Convert datatime variable to datetime64 format:
tides['datetime'] = pd.to_datetime(tides['datetime'])
# Plot figure of 'datetime' vs 'tide':
fig = plt.figure()
ax_tides = fig.add_subplot(1,1,1)
ax_tides.plot_date(date2num(phj_tides['datetime']),phj_tides['tide'],'-',xdate=True,label='Tides 2011',linewidth=0.5)
min_datetime = dt.datetime.strptime('01/01/2011 00:00:00',"%d/%m/%Y %H:%M:%S")
max_datetime = dt.datetime.strptime('03/01/2011 23:59:45',"%d/%m/%Y %H:%M:%S")
ax_tides.set_xlim( [min_datetime, max_datetime] )
plt.show()
The plot shows just the first few days of data. However, at the change from one day to the next, something strange happens; after the last point of day 1, the line disappears off to the right and then returns to plot the first point of the second day - but the data is plotted incorrectly on the y axis. This happens throughout the dataset. A printout shows that the data seems to be OK.
number datetime tide
0 1 2011-01-01 00:00:00 4.296
1 2 2011-01-01 00:15:00 4.024
2 3 2011-01-01 00:30:00 3.768
3 4 2011-01-01 00:45:00 3.521
4 5 2011-01-01 01:00:00 3.292
5 6 2011-01-01 01:15:00 3.081
6 7 2011-01-01 01:30:00 2.887
7 8 2011-01-01 01:45:00 2.718
8 9 2011-01-01 02:00:00 2.577
9 10 2011-01-01 02:15:00 2.470
10 11 2011-01-01 02:30:00 2.403
11 12 2011-01-01 02:45:00 2.389
12 13 2011-01-01 03:00:00 2.417
13 14 2011-01-01 03:15:00 2.492
14 15 2011-01-01 03:30:00 2.611
15 16 2011-01-01 03:45:00 2.785
16 17 2011-01-01 04:00:00 3.020
17 18 2011-01-01 04:15:00 3.314
18 19 2011-01-01 04:30:00 3.665
19 20 2011-01-01 04:45:00 4.059
20 21 2011-01-01 05:00:00 4.483
[21 rows x 3 columns]
number datetime tide
90 91 2011-01-01 22:30:00 7.329
91 92 2011-01-01 22:45:00 7.014
92 93 2011-01-01 23:00:00 6.690
93 94 2011-01-01 23:15:00 6.352
94 95 2011-01-01 23:30:00 6.016
95 96 2011-01-01 23:45:00 5.690
96 97 2011-02-01 00:00:00 5.366
97 98 2011-02-01 00:15:00 5.043
98 99 2011-02-01 00:30:00 4.729
99 100 2011-02-01 00:45:00 4.426
100 101 2011-02-01 01:00:00 4.123
101 102 2011-02-01 01:15:00 3.832
102 103 2011-02-01 01:30:00 3.562
103 104 2011-02-01 01:45:00 3.303
104 105 2011-02-01 02:00:00 3.055
105 106 2011-02-01 02:15:00 2.827
106 107 2011-02-01 02:30:00 2.620
107 108 2011-02-01 02:45:00 2.434
108 109 2011-02-01 03:00:00 2.268
109 110 2011-02-01 03:15:00 2.141
110 111 2011-02-01 03:30:00 2.060
[21 rows x 3 columns]
number datetime tide
35020 35021 2011-12-31 19:00:00 5.123
35021 35022 2011-12-31 19:15:00 4.838
35022 35023 2011-12-31 19:30:00 4.551
35023 35024 2011-12-31 19:45:00 4.279
35024 35025 2011-12-31 20:00:00 4.033
35025 35026 2011-12-31 20:15:00 3.803
35026 35027 2011-12-31 20:30:00 3.617
35027 35028 2011-12-31 20:45:00 3.438
35028 35029 2011-12-31 21:00:00 3.278
35029 35030 2011-12-31 21:15:00 3.141
35030 35031 2011-12-31 21:30:00 3.019
35031 35032 2011-12-31 21:45:00 2.942
35032 35033 2011-12-31 22:00:00 2.909
35033 35034 2011-12-31 22:15:00 2.918
35034 35035 2011-12-31 22:30:00 2.923
35035 35036 2011-12-31 22:45:00 2.985
35036 35037 2011-12-31 23:00:00 3.075
35037 35038 2011-12-31 23:15:00 3.242
35038 35039 2011-12-31 23:30:00 3.442
35039 35040 2011-12-31 23:45:00 3.671
I am at a loss to explain this. Can anyone explain what is happening, why it is happening and how can I correct it?
Thanks in advance.
Phil
Doh! Finally found the answer. The original workflow was quite complicated. I stored the data in an Excel spreadsheet and used openpyxl to read data from a named cell range. This was then converted to a pandas DataFrame. The date-and-time variable was converted to datetime format using pandas' .to_datetime() function. And finally the data were plotted using matplotlib. As I was preparing the data to post to this forum (as suggested by rauparaha) and paring down the script to it bare essentials, I noticed that Day1 data were plotted on 01 Jan 2011 but Day2 data were plotted on 01 Feb 2011. If you look at the output in the original post, the dates are mixed formats: The last date given is '2011-12-31' (i.e. year-month-day') but the 2nd date representing 2nd Jan 2011 is '2011-02-01' (i.e. year-day-month).
So, looks like I misunderstood how the pandas .to_datetime() function interprets datetime information. I had purposely had not set the infer_datetime_format attribute (default=False) and had assumed any problems would have been flagged up. But it seems pandas assumes dates are in a month-first format. Unless they're not, in which case, it changes to a day-first format. I should have picked that up!
I have corrected the problem by providing a string that explicitly defines the datetime format. All is fine again.
Thanks again for your suggestions. And apologies for any confusion.
Cheers.
I have been unable to replicate your error but perhaps my working dummy code can help diagnose the problem. I generated dummy data and plotted it with this code:
import pandas as pd
import numpy as np
ydata = np.sin(np.linspace(0, 10, num=200))
time_index = pd.date_range(start=pd.datetime(2000, 1, 1, 0, 0), periods=200, freq=15*pd.datetools.Minute())
df = pd.DataFrame({'tides': ydata, 'datetime': time_index})
df.plot(x='datetime', y='tides')
My data looks like this
datetime tides
0 2000-01-01 00:00:00 0.000000
1 2000-01-01 00:15:00 0.050230
2 2000-01-01 00:30:00 0.100333
3 2000-01-01 00:45:00 0.150183
4 2000-01-01 01:00:00 0.199654
[200 rows]
and generates the following plot

Categories

Resources