Say I create a fully random Dataframe using the following:
from pandas.util import testing
from random import randrange
def random_date(start, end):
delta = end - start
int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
random_second = randrange(int_delta)
return start + timedelta(seconds=random_second)
def rand_dataframe():
df = testing.makeDataFrame()
df['date'] = [random_date(datetime.date(2014,3,18),datetime.date(2014,4,1)) for x in xrange(df.shape[0])]
df.sort(columns=['date'], inplace=True)
return df
df = rand_dataframe()
which results in the dataframe shown at the bottom of this post. I would like to plot my columns A, B, C and D using the timeseries visualization features in seaborn so that I get something along these lines:
How can I approach this problem? From what I read on this notebook, the call should be:
sns.tsplot(df, time="time", unit="unit", condition="condition", value="value")
but this seems to require that the dataframe is represented in a different way, with the columns somehow encoding time, unit, condition and value, which is not my case. How can I convert my dataframe (shown below) into this format?
Here is my dataframe:
date A B C D
2014-03-18 1.223777 0.356887 1.201624 1.968612
2014-03-18 0.160730 1.888415 0.306334 0.203939
2014-03-18 -0.203101 -0.161298 2.426540 0.056791
2014-03-18 -1.350102 0.990093 0.495406 0.036215
2014-03-18 -1.862960 2.673009 -0.545336 -0.925385
2014-03-19 0.238281 0.468102 -0.150869 0.955069
2014-03-20 1.575317 0.811892 0.198165 1.117805
2014-03-20 0.822698 -0.398840 -1.277511 0.811691
2014-03-20 2.143201 -0.827853 -0.989221 1.088297
2014-03-20 0.299331 1.144311 -0.387854 0.209612
2014-03-20 1.284111 -0.470287 -0.172949 -0.792020
2014-03-22 1.031994 1.059394 0.037627 0.101246
2014-03-22 0.889149 0.724618 0.459405 1.023127
2014-03-23 -1.136320 -0.396265 -1.833737 1.478656
2014-03-23 -0.740400 -0.644395 -1.221330 0.321805
2014-03-23 -0.443021 -0.172013 0.020392 -2.368532
2014-03-23 1.063545 0.039607 1.673722 1.707222
2014-03-24 0.865192 -0.036810 -1.162648 0.947431
2014-03-24 -1.671451 0.979238 -0.701093 -1.204192
2014-03-26 -1.903534 -1.550349 0.267547 -0.585541
2014-03-27 2.515671 -0.271228 -1.993744 -0.671797
2014-03-27 1.728133 -0.423410 -0.620908 1.430503
2014-03-28 -1.446037 -0.229452 -0.996486 0.120554
2014-03-28 -0.664443 -0.665207 0.512771 0.066071
2014-03-29 -1.093379 -0.936449 -0.930999 0.389743
2014-03-29 1.205712 -0.356070 -0.595944 0.702238
2014-03-29 -1.069506 0.358093 1.217409 -2.286798
2014-03-29 2.441311 1.391739 -0.838139 0.226026
2014-03-31 1.471447 -0.987615 0.201999 1.228070
2014-03-31 -0.050524 0.539846 0.133359 -0.833252
In the end, what I am looking for is an overlay of of plots (one per column), where each of them looks as follows (note that different values of CI get different values of alphas):
I don't think tsplot is going to work with the data you have. The assumptions it makes about the input data are that you've sampled the same units at each timepoint (although you can have missing timepoints for some units).
For example, say you measured blood pressure from the same people every day for a month, and then you wanted to plot the average blood pressure by condition (where maybe the "condition" variable is the diet they are on). tsplot could do this, with a call that would look something like sns.tsplot(df, time="day", unit="person", condition="diet", value="blood_pressure")
That scenario is different from having large groups of people on different diets and each day randomly sampling some from each group and measuring their blood pressure. From the example you gave, it seems like your data are structured like the this.
However, it's not that hard to come up with a mix of matplotlib and pandas that will do what I think you want:
# Read in the data from the stackoverflow question
df = pd.read_clipboard().iloc[1:]
# Convert it to "long-form" or "tidy" representation
df = pd.melt(df, id_vars=["date"], var_name="condition")
# Plot the average value by condition and date
ax = df.groupby(["condition", "date"]).mean().unstack("condition").plot()
# Get a reference to the x-points corresponding to the dates and the the colors
x = np.arange(len(df.date.unique()))
palette = sns.color_palette()
# Calculate the 25th and 75th percentiles of the data
# and plot a translucent band between them
for cond, cond_df in df.groupby("condition"):
low = cond_df.groupby("date").value.apply(np.percentile, 25)
high = cond_df.groupby("date").value.apply(np.percentile, 75)
ax.fill_between(x, low, high, alpha=.2, color=palette.pop(0))
This code produces:
Related
I have the following problem. Suppose I have a wide data Frame consisting of three columns (mock example follows below). Essentially, it consists of three factors, A, B and C for which I have certain values for each business day within a time range.
import pandas as pd
import numpy as np
index_d = pd.bdate_range(start='10/5/2022', end='10/27/2022')
index = np.repeat(index_d,3)
values = np.random.randn(3*len(index_d), 1)
columns_v = len(index_d)*["A","B","C"]
df = pd.DataFrame()
df["x"] = np.asarray(index)
df["y"] = values
df["factor"] = np.asarray([columns_v]).T
I would like to plot the business weekly averages of the the three factors along time. A business week goes from Monday to Friday. However, in the example above I start within a week and end within a week. That means the first weekly averages consist only of the data points on 5th, 6th and 7th of October. Similar for the last week. Ideally, the output should have the form
dt1 = dt.datetime.strptime("20221007", "%Y%m%d").date()
dt2 = dt.datetime.strptime("20221014", "%Y%m%d").date()
dt3 = dt.datetime.strptime("20221021", "%Y%m%d").date()
dt4 = dt.datetime.strptime("20221027", "%Y%m%d").date()
d = 3*[dt1, dt2, dt3, dt4]
values = np.random.randn(len(d), 1)
factors = 4*["A","B","C"]
df_output = pd.DataFrame()
df_output["time"] = d
df_output["values"] = values
df_output["factors"] = factors
I can then plot the weekly averages using seaborn as a lineplot with hue. Important to note is that the respective time value for weekly average is always the last business day in that week (Friday except for the last, where it is a Thursday).
I was thinking of groupby. However, given that my real data is much larger and has possibly some NaN I'm not sure how to do it. In particular with regards to the random start / end points that don't need to be Monday / Friday.
Try as follows:
res = df.groupby([pd.Grouper(key='x', freq='W-FRI'),df.factor])['y'].mean()\
.reset_index(drop=False)
res = res.rename(columns={'x':'time','factor':'factors','y':'values'})
res['time'] = res.time.map(pd.merge_asof(df.x, res.time, left_on='x',
right_on='time', direction='forward')\
.groupby('time').last()['x']).astype(str)
print(res)
time factors values
0 2022-10-07 A 0.171228
1 2022-10-07 B -0.250432
2 2022-10-07 C -0.126960
3 2022-10-14 A 0.455972
4 2022-10-14 B 0.582900
5 2022-10-14 C 0.104652
6 2022-10-21 A -0.526221
7 2022-10-21 B 0.371007
8 2022-10-21 C 0.012099
9 2022-10-27 A -0.123510
10 2022-10-27 B -0.566441
11 2022-10-27 C -0.652455
Plot data:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
fig, ax = plt.subplots(figsize=(8,5))
ax = sns.lineplot(data=res, x='time', y='values', hue='factors')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.show()
Result:
Explanation
First, apply df.groupby. Grouping by factor is of course easy; for the dates we can use pd.Grouper with freq parameter set to W-FRI (each week through to Friday), and then we want to get the mean for column y (NaN values will just be ignored).
In the next step, let's use df.rename to rename the columns.
We are basically done now, except for the fact that pd.Grouper will use each Friday (even if it isn't present in the actual set). E.g.:
print(res.time.unique())
['2022-10-07T00:00:00.000000000' '2022-10-14T00:00:00.000000000'
'2022-10-21T00:00:00.000000000' '2022-10-28T00:00:00.000000000']
If you are OK with this, you can just start plotting (but see below). If you would like to get '2022-10-27' instead of '2022-10-28', we can combine Series.map applied to column time with pd.merge_asof,and perform another groupby to get last in column x. I.e. this will get us the closest match to each Friday within each week (so, in fact just Friday in all cases, except the last: 2022-10-17).
In either scenario, before plotting, make sure to turn the datetime values into strings: res['time'] = res['time'].astype(str)!
You can add a column with the calendar week:
df['week'] = df.x.dt.isocalendar().week
Get a mask for all the Fridays, and for the last day:
last_of_week = (df.x.dt.isocalendar().day == 5).values
last_of_week[-1] = True
Get the actual dates:
last_days = df.x[last_of_week].unique()
Group by week and factor, take the mean:
res = df.groupby(['week', 'factor']).mean().reset_index()
Clean up:
res = res.drop('week', axis=1)
res['x'] = pd.Series(last_days).repeat(3).reset_index(drop=True)
I have a .txt file with three columns: Time, ticker, price. The time is spaced in 15 second intervals. It looks like this uploaded to jupyter notebook and put into a Pandas DF.
time ticker price
0 09:30:35 EV 33.860
1 00:00:00 AMG 60.430
2 09:30:35 AMG 60.750
3 00:00:00 BLK 455.350
4 09:30:35 BLK 451.514
... ... ... ...
502596 13:00:55 TLT 166.450
502597 13:00:55 VXX 47.150
502598 13:00:55 TSLA 529.800
502599 13:00:55 BIDU 103.500
502600 13:00:55 ON 12.700
# NOTE: the first set of data has the data at market open for -
# every other time point, so that's what the 00:00:00 is.
#It is only limited to the 09:30:35 data.
I need to create a function that takes an input (a ticker) and then creates a bar chart that displays the data with 5 minute ticks ( the data is every 20 seconds, so for every 15 points in time).
So far I've thought about separating the "mm" part of the hh:mm:ss to just get the minutes in another column and then right a for loop that looks something like this:
for num in df['mm']:
if num %5 == 0:
print('tick')
then somehow appending the "tick" to the "time" column for every 5 minutes of data (I'm not sure how I would do this), then using the time column as the index and only using data with the "tick" index in it (some kind of if statement). I'm not sure if this makes sense but I'm drawing a blank on this.
You should have a look at the built-in functions in pandas. In the following example I'm using a date + time format but it shouldn't be hard to convert one to the other.
Generate data
%matplotlib inline
import pandas as pd
import numpy as np
dates = pd.date_range(start="2020-04-01", periods=150, freq="20S")
df1 = pd.DataFrame({"date":dates,
"price":np.random.rand(len(dates))})
df2 = df1.copy()
df1["ticker"] = "a"
df2["ticker"] = "b"
df = pd.concat([df1,df2], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
Resample Timeseries every 5 minutes
Here you can try to see the output of
df1.set_index("date")\
.resample("5T")\
.first()\
.reset_index()
Where we are considering just the first element at 05:00, 10:00 and so on. In general to do the same for every ticker we need a groupby
out = df.groupby("ticker")\
.apply(lambda x: x.set_index("date")\
.resample("5T")\
.first()\
.reset_index())\
.reset_index(drop=True)
Plot function
def plot_tick(data, ticker):
ts = data[data["ticker"]==ticker].reset_index(drop=True)
ts.plot(x="date", y="price", kind="bar", title=ticker);
plot_tick(out, "a")
Then you can improve the plot or, eventually, try to use plotly.
There's a sensor dataset, and the values in value column needs to be corrected based on one specific sensor R in the data. The values are directions in degrees (circle 360 degrees). The correction method is as below formula, for each individual sensor i, calculate sum of sine /cosine differences respecting to the reference sensor and get the corrected degrees by calculating artanh. Then minus it from its original values. Vi(t) is the value of sensor i at time t, and VR(t) is the value of Reference sensor R at time t.
date sensor value tag
0 2000-01-01 1 200 a
1 2000-01-02 1 200 a
''''''''''''''''''''''''''''''''
7 2000-01-08 1 300 b
8 2000-01-02 2 202 c
9 2000-01-03 2 204 c
10 2000-01-04 2 206 c
I have tried some but little confused in how to complete this request in a for loop.
The timestamps for sensors are not matching. The individual sensor may have more or less timestamps than the reference sensor.
I want to add an additional column to store corrected values.
Below is the sample dataset I made. If choose sensor 2 as the reference sensor to correct other sensor values, how can I complete it in a python loop. Thanks in advance!
import pandas as pd
sensor1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=8),"sensor":[1,1,1,1,1,1,1,1],"value":[200,200,200,200,200,300,300,300],"tag":pd.Series(['a','b']).repeat(4)})
sensor2 = pd.DataFrame({"date": pd.date_range('1/2/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[202,204,206,208,220,250,300,320,280,260],"tag":pd.Series(['c','d']).repeat(5)})
sensor3 = pd.DataFrame({"date": pd.date_range('1/3/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[265,222,232,220,260,300,250,200,190,223],"tag":pd.Series(['e','f']).repeat(5)})
sensor4 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=11),"sensor":[4,4,4,4,4,4,4,4,4,4,4],"value":[206,203,210,253,237,282,320,232,255,225,262],"tag":pd.Series(['c']).repeat(11)})
sensordata = sensor1.append([sensor2,sensor3,sensor4]).reset_index(drop = True)
Here is an inelegant solution, using for loops and multiple merges. As an example, I use sensor4 to correct the remaining sensors. The correction formula was not 100% clear to me, so I interpreted it as adding the sine and the cosine.
def data_correction(vi, vr):
return vi - np.arctan(np.sum(np.sin(vi-vr) + np.cos(vi-vr), axis=0)) # i assume sin and cosine are summed?
sensors = [sensor1, sensor2, sensor3] # assuming you want to correct with sensor 4
sensorR = sensor4.copy()
for i in range(len(sensors)):
# create temp dataframe, with merge on date, so that measurements line up
temp = pd.merge(sensors[i], sensorR, how='inner', left_on='date', right_on='date')
# do correction and assign to new column
temp['value_corrected'] = data_correction(temp['value_x'], temp['value_y'])
# add this column to the original sensor data
sensors[i] = sensors[i].merge(temp[['date', 'value_corrected']], how='inner', left_on='date', right_on='date')
I am trying to take yearly max rainfall data for multiple years of data within one array. I understand how you would need to use a for loop if I wanted to take the max of a single range, I saw there was similar question to the problem I'm having. However, I need to take leap year into account!
So for the first year I have 14616 data points from 1960-1965, not including 1965, which contains 2 leap years: 1960 and 1964. A leap year contains 2928 data points and every other year contains 2920 data points.
I first thought was to modify the solution from the similar question which involved using a for loop as follows (just a straight copy paste from their's):
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
Their's involved taking the average of every 600 lines in their data. I thought there might be a way to just modify this, but I couldn't figure out a way for it to work. If modification of this won't work, is there another way to loop it with the leap years taken into account without completely cutting up the file manually.
Fake data:
import numpy as np
fake = np.random.randint(2, 30, size = 14616)
Use pandas to handle the leap year functionality.
Create timestamps for your data with pandas.date_range().
import pandas as pd
index = pd.date_range(start = '1960-1-1 00:00:00', end = '1964-12-31 23:59:59' , freq='3H')
Then create a DataFrame using the timestamps for the index.
df = pd.DataFrame(data = fake, index = index)
Aggregate by year - taking advantage of the DatetimeIndex flexibilty.
>>> df['1960'].max()
0 29
dtype: int32
>>> df['1960'].mean()
0 15.501366
dtype: float64
>>>
>>> len(df['1960'])
2928
>>> len(df['1961'])
2920
>>> len(df['1964'])
2928
>>>
I just cobbled this together from the Time Series / Date functionality section of the docs. Given pandas capability this looks a bit naive and probably can be improved upon.
Like resampling (using the same DataFrame)
>>> df.resample('A').mean()
0
1960-12-31 15.501366
1961-12-31 15.170890
1962-12-31 15.412329
1963-12-31 15.538699
1964-12-31 15.382514
>>> df.resample('A').max()
0
1960-12-31 29
1961-12-31 29
1962-12-31 29
1963-12-31 29
1964-12-31 29
>>>
>>> r = df.resample('A')
>>> r.agg([np.sum, np.mean, np.std])
0
sum mean std
1960-12-31 45388 15.501366 8.211835
1961-12-31 44299 15.170890 8.117072
1962-12-31 45004 15.412329 8.257992
1963-12-31 45373 15.538699 7.986877
1964-12-31 45040 15.382514 8.178057
>>>
Food for thought:
Time-aware Rolling vs. Resampling
Say I have a dataframe like this (notebook text version follows the image):
A is Arrival Flight (landing), D is Departure flight (take off).
Carrier and FltReg together is a single aircraft.. that arrives and departs an airport, and it will come back again to same airport.. after few hours or days.
Acft is the type of aircraft.
The arrivals and departures need to be matched so that the resulting dataframe can be used for calculations and drawing gantt chart (start time i.e. arrival time and end time i.e. departure time... the time the flight is on the ground.)
the data will normally will continue for 7 days flight schedules and many more carriers.. about 3000 rows for 7 days... coming from sql server database
from io import StringIO
import pandas as pd
dfstr = StringIO(u"""
ID;Car;FltNo;Acft;FltReg;E_FltType;Rtg;STADDtTm;ArrDep
0;EK;376;77W;A6ECI;T/A;DXB-BKK-DXB;03/05/2017 12:50;A
1;EK;377;77W;A6ECI;T/A;DXB-BKK-DXB;03/05/2017 15:40;D
2;EK;384;380;A6EDL;T/S;DXB-BKK-HKG;02/05/2017 12:15;A
3;EK;384;380;A6EDL;T/S;DXB-BKK-HKG;02/05/2017 14:00;D
4;EK;385;380;A6EDL;T/A;HKG-BKK-DXB;02/05/2017 23:45;A
5;EK;385;380;A6EDL;T/A;HKG-BKK-DXB;03/05/2017 01:15;D
54;VZ;920;320;HSVKA;DEP ONLY;BKK-HPH;01/05/2017 11:15;D
55;VZ;921;320;HSVKA;ARR ONLY;HPH-BKK;01/05/2017 15:25;A
56;VZ;602;320;HSVKA;DEP ONLY;BKK-CNX;01/05/2017 16:35;D
57;VZ;603;320;HSVKA;ARR ONLY;CNX-BKK;01/05/2017 19:45;A
58;VZ;602;320;HSVKA;DEP ONLY;BKK-CNX;02/05/2017 11:15;D
59;VZ;603;320;HSVKA;ARR ONLY;CNX-BKK;02/05/2017 14:25;A
60;VZ;820;320;HSVKA;DEP ONLY;BKK-HKT;03/05/2017 07:05;D
61;VZ;821;320;HSVKA;ARR ONLY;HKT-BKK;03/05/2017 15:45;A
62;VZ;828;320;HSVKA;DEP ONLY;BKK-HKT;03/05/2017 18:20;D
63;VZ;829;320;HSVKA;ARR ONLY;HKT-BKK;03/05/2017 21:50;A
64;VZ;600;320;HSVKB;DEP ONLY;BKK-CNX;01/05/2017 06:10;D
65;VZ;601;320;HSVKB;ARR ONLY;CNX-BKK;01/05/2017 09:20;A
66;VZ;606;320;HSVKB;DEP ONLY;BKK-CNX;01/05/2017 09:50;D
67;VZ;607;320;HSVKB;ARR ONLY;CNX-BKK;01/05/2017 13:00;A
""")
df = pd.read_csv(dfstr, sep=";", index_col='ID')
df
Question 1: How to conver the above dataframe to below.
I want this converted to same rows if the Car and FltReg is same.. for e.g. ID 0, EK376 A6ECI Arrival at 03May 12:50 and departs as ID 1, EK377 A6ECI at 03May 15:40... similarly for ID2 and 3, ID4 and 5... these are 3 different aircrafts as highlighted in BOLD. many other flights in between... then Next comes, ID54 which is a VZ Carrier with Aircraft Reg HSKVA... and it departs first, so it should be on its own row... then it arrives ID55 and departs as ID56, and arrives again as ID57 and departs as ID58.
Here is how the resulting dataframe should look like:
from io import StringIO
import pandas as pd
dfstr = StringIO(u"""
IDArr;Car;FltNo;Acft;FltReg;E_FltType;Rtg;STADDtTm;ArrDep;IDDep;Car;FltNo;Acft;FltReg;E_FltType;Rtg;STADDtTm;ArrDep
0;EK;376;77W;A6ECI;T/A;DXB-BKK-DXB;03/05/2017 12:50;A;1;EK;377;77W;A6ECI;T/A;DXB-BKK-DXB;03/05/2017 15:40;D
2;EK;384;380;A6EDL;T/S;DXB-BKK-HKG;02/05/2017 12:15;A;3;EK;384;380;A6EDL;T/S;DXB-BKK-HKG;02/05/2017 14:00;D
4;EK;385;380;A6EDL;T/A;HKG-BKK-DXB;02/05/2017 23:45;A;5;EK;385;380;A6EDL;T/A;HKG-BKK-DXB;03/05/2017 01:15;D
;;;;;;;;;54;VZ;920;320;HSVKA;DEP ONLY;BKK-HPH;01/05/2017 11:15;D
55;VZ;921;320;HSVKA;ARR ONLY;HPH-BKK;01/05/2017 15:25;A;56;VZ;602;320;HSVKA;DEP ONLY;BKK-CNX;01/05/2017 16:35;D
57;VZ;603;320;HSVKA;ARR ONLY;CNX-BKK;01/05/2017 19:45;A;58;VZ;602;320;HSVKA;DEP ONLY;BKK-CNX;02/05/2017 11:15;D
59;VZ;603;320;HSVKA;ARR ONLY;CNX-BKK;02/05/2017 14:25;A;60;VZ;820;320;HSVKA;DEP ONLY;BKK-HKT;03/05/2017 07:05;D
61;VZ;821;320;HSVKA;ARR ONLY;HKT-BKK;03/05/2017 15:45;A;62;VZ;828;320;HSVKA;DEP ONLY;BKK-HKT;03/05/2017 18:20;D
63;VZ;829;320;HSVKA;ARR ONLY;HKT-BKK;03/05/2017 21:50;A;;;;;;;;;
;;;;;;;;;64;VZ;600;320;HSVKB;DEP ONLY;BKK-CNX;01/05/2017 06:10;D
65;VZ;601;320;HSVKB;ARR ONLY;CNX-BKK;01/05/2017 09:20;A;66;VZ;606;320;HSVKB;DEP ONLY;BKK-CNX;01/05/2017 09:50;D
67;VZ;607;320;HSVKB;ARR ONLY;CNX-BKK;01/05/2017 13:00;A;;;;;;;;;
""")
df2 = pd.read_csv(dfstr, sep=";")
df2
As you can see... we can see ID0 and ID1 is matched in same row... thus it is easier to see how long the flight is on the ground (that is in the airport)... from 12:50 to 15:40 (2 hour 50 mins)... and so on for the rest of the flights.
Question 2: Make Gantt chart with above resulting data frame
This resulting dataframe will then be used for generating Gantt Charts.
that is example Aircraft: HSKVA (VZ Flight) will have its own row... with 11:15 departure first (gantt drawn from 10:15 (1hr before departure as there is no arrival) to 11:15. then gantt drawn in same row for 15:25 to 16:35, 19:45 to 11:15 next day, 14:25 to 07:05, 15:45 to 18:20, 21:50 to 22:50 (one hour after flight arrival as there isno departure). broken_barh of matplotlib comes to mind
HSKVB will have its own row for the gantt... and so on.
Each Carrier/Aircraft Reg on its own row for the visual.
Question 1
One quick change to your setup is that I didn't set ID as the index_col because I want to use its value quickly in a groupby().shift. So starting from that modified read_csv:
df = pd.read_csv(dfstr, sep=";")
cols = df.columns.values.tolist()
A big part of the solution is making sure the df is ordered by Car, FltReg, and STADDtTm (because the first two are the unique identifiers, and the last is the main sort value).
sort_cols = ['Car', 'FltReg', 'STADDtTm']
df.sort_values(by=sort_cols, inplace=True)
So now we're at the main part of the logic. I'm going to separate df into arrivals and departures, and the way the two are going to be joined is by a shifted ID. That is, for any (Car, FltReg) partition, I know to pair a given 'A' row with the 'D' row immediately after it. So again, this is why we need sorted (and complete) data.
Let's generate that shifted ID:
# sort_cols[:2] is `Car` and `FltReg` together
df['NextID'] = df.groupby(sort_cols[:2])['ID'].shift(1)
Now using an 'A'-filtered df and a 'D'-filtered df, I am going to full-outer-join them together. Arrivals (left dataset) are keyed by the original ID, and departures (right dataset) are keyed by the NextID we just made.
df_display = df[df['ArrDep'] == 'A'] \
.merge(df[df['ArrDep'] == 'D'],
how='outer',
left_on='ID',
right_on='NextID',
suffixes=('1', '2'))
Note that the columns will now be suffixed with 1 (left) and 2 (right).
At this point, this new dataframe df_display has all the rows it needs, but it doesn't have the nice sort in your final display. To accomplish this, you need the sort_cols list again, but coalesced versions of each column that put the respective left and right versions together. For example, Car1 and Car2 have to be coalesced together, so that you can sort all rows by the combined version.
pandas' combine_first is like coalesce.
# purely for sorting the final display
for c in sort_cols:
df_display['sort_' + c] = df_display[c + '1'] \
.combine_first(df_display[c + '2'])
# for example, Car1 and Car2 have now been coalesced into sort_Car
df_display.sort_values(by=['sort_{}'.format(c) for c in sort_cols], inplace=True)
We're almost done. Now df_display has extraneous columns that we don't need. We can select only the columns we want—basically, two copies of the original column list cols.
df_display = df_display[['{}1'.format(c) for c in cols] + ['{}2'.format(c) for c in cols]]
df_display.to_csv('output.csv', index=None)
I checked (in a csv export so that we could see the wide dataset) that this matches your sample.
Question 2
Okay, so if you play around with the code at https://matplotlib.org/examples/pylab_examples/broken_barh.html, you can see how broken_barh operates. This is important, as we have to make the data fit this structure to be able to use it. broken_barh's first argument is a list of tuples to plot, and each tuple is a (start time, duration).
For matplotlib, the start time has to be in its special date format. So we have to convert pandas datetimes using matplotlib.dates.date2num. Finally, the duration seems like it's in day units.
Thus, if HSVKA arrives at 2017-05-01 15:25:00 and is on the ground for 70 minutes, then broken_barh needs to plot the tuple (mdates.date2num(Timestamp('2017-05-03 15:25:00')), 70 minutes in day units or 0.04861).
So the first step is getting df_display from Question 1 in this format. We only need to focus on the four columns 'Car1', 'FltReg1', 'STADDtTm1', 'STADDtTm2' now.
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn # optional ... I like the look
print(df_display[['Car1', 'FltReg1', 'STADDtTm1', 'STADDtTm2']])
which looks like
Car1 FltReg1 STADDtTm1 STADDtTm2
0 EK A6ECI 03/05/2017 12:50 03/05/2017 15:40
1 EK A6EDL 02/05/2017 12:15 02/05/2017 14:00
2 EK A6EDL 02/05/2017 23:45 03/05/2017 01:15
10 NaN NaN NaN 01/05/2017 11:15
3 VZ HSVKA 01/05/2017 15:25 01/05/2017 16:35
4 VZ HSVKA 01/05/2017 19:45 02/05/2017 11:15
5 VZ HSVKA 02/05/2017 14:25 03/05/2017 07:05
6 VZ HSVKA 03/05/2017 15:45 03/05/2017 18:20
7 VZ HSVKA 03/05/2017 21:50 NaN
11 NaN NaN NaN 01/05/2017 06:10
8 VZ HSVKB 01/05/2017 09:20 01/05/2017 09:50
9 VZ HSVKB 01/05/2017 13:00 NaN
There are NaNs when an arrival or departure is missing. Imputing these is fairly straightforward. I noticed in your write-up that you wanted one-hour buffers on either side when something is missing. So here's all of that straightforward wrangling:
df_gantt = df_display.copy()
# Convert to pandas timestamps for date arithmetic
df_gantt['STADDtTm1'] = pd.to_datetime(df_gantt['STADDtTm1'],
format='%d/%m/%Y %H:%M')
df_gantt['STADDtTm2'] = pd.to_datetime(df_gantt['STADDtTm2'],
format='%d/%m/%Y %H:%M')
# Impute identifiers
df_gantt['Car'] = df_gantt['Car1'].combine_first(df_gantt['Car2'])
df_gantt['FltReg'] = df_gantt['FltReg1'].combine_first(df_gantt['FltReg2'])
# Also just gonna combine Car and FltReg
# into a single column for simplicty
df_gantt['Car_FltReg'] = df_gantt['Car'] + ': ' + df_gantt['FltReg']
# Impute hour gaps
df_gantt['STADDtTm1'] = df_gantt['STADDtTm1'] \
.fillna(df_gantt['STADDtTm2'] - pd.Timedelta('1 hour'))
df_gantt['STADDtTm2'] = df_gantt['STADDtTm2'] \
.fillna(df_gantt['STADDtTm1'] + pd.Timedelta('1 hour'))
# Date diff in day units
df_gantt['DayDiff'] = (df_gantt['STADDtTm2'] - df_gantt['STADDtTm1']).dt.seconds \
/ 60 / 60 / 24
# matplotlib numeric date format
df_gantt['STADDtTm1'] = df_gantt['STADDtTm1'].apply(mdates.date2num)
df_gantt['STADDtTm2'] = df_gantt['STADDtTm2'].apply(mdates.date2num)
df_gantt = df_gantt[['Car_FltReg', 'STADDtTm1', 'STADDtTm2', 'DayDiff']]
print(df_gantt)
which now looks like
Car_FltReg STADDtTm1 STADDtTm2 DayDiff
0 EK: A6ECI 736452.534722 736452.652778 0.118056
1 EK: A6EDL 736451.510417 736451.583333 0.072917
2 EK: A6EDL 736451.989583 736452.052083 0.062500
10 VZ: HSVKA 736450.427083 736450.468750 0.041667
3 VZ: HSVKA 736450.642361 736450.690972 0.048611
4 VZ: HSVKA 736450.822917 736451.468750 0.645833
5 VZ: HSVKA 736451.600694 736452.295139 0.694444
6 VZ: HSVKA 736452.656250 736452.763889 0.107639
7 VZ: HSVKA 736452.909722 736452.951389 0.041667
11 VZ: HSVKB 736450.215278 736450.256944 0.041667
8 VZ: HSVKB 736450.388889 736450.409722 0.020833
9 VZ: HSVKB 736450.541667 736450.583333 0.041667
Now make a dict where each key is a unique Car_FltReg and each value is a list of tuples (as described earlier) that can be fed into broken_barh.
dict_gantt = df_gantt.groupby('Car_FltReg')['STADDtTm1', 'DayDiff'] \
.apply(lambda x: list(zip(x['STADDtTm1'].tolist(),
x['DayDiff'].tolist()))) \
.to_dict()
So dict_gantt looks like
{'EK: A6ECI': [(736452.5347222222, 0.11805555555555557)],
'EK: A6EDL': [(736451.5104166666, 0.07291666666666667),
(736451.9895833334, 0.0625)],
'VZ: HSVKA': [(736450.4270833334, 0.041666666666666664),
(736450.6423611111, 0.04861111111111111),
(736450.8229166666, 0.6458333333333334),
(736451.6006944445, 0.6944444444444445),
(736452.65625, 0.1076388888888889),
(736452.9097222222, 0.041666666666666664)],
'VZ: HSVKB': [(736450.2152777778, 0.041666666666666664),
(736450.3888888889, 0.020833333333333332),
(736450.5416666666, 0.041666666666666664)]}
Perfect for broken_barh. And now it's all the matplotlib craziness. After the core logic to prepare for broken_barh stuff, everything else is just the painstaking tick formatting, etc. If you've customized something in matplotlib, this stuff should be familiar—I won't explain much of it.
FltReg_list = sorted(dict_gantt, reverse=True)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
start_datetime = df_gantt['STADDtTm1'].min()
end_datetime = df_gantt['STADDtTm2'].max()
# parameters for yticks, etc.
# you might have to play around
# with the different parts to modify
n = len(FltReg_list)
bar_size = 9
for i, bar in enumerate(FltReg_list):
ax.broken_barh(dict_gantt[bar], # data
(10 * (i + 1), bar_size), # (y position, bar size)
alpha=0.75,
edgecolor='k',
linewidth=1.2)
# I got date formatting ideas from
# https://matplotlib.org/examples/pylab_examples/finance_demo.html
ax.set_xlim(start_datetime, end_datetime)
ax.xaxis.set_major_locator(mdates.HourLocator(byhour=range(0, 24, 6)))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m-%d %H:%M'))
ax.xaxis.set_minor_locator(mdates.HourLocator(byhour=range(0, 24, 1)))
# omitting minor labels ...
plt.grid(b=True, which='minor', color='w', linestyle='dotted')
ax.set_yticks([5 + 10 * n for n in range(1, n + 1)])
ax.set_ylim(5, 5 + 10 * (n + 1))
ax.set_yticklabels(FltReg_list)
ax.set_title('Time on Ground')
ax.set_ylabel('Carrier: Registration')
plt.setp(plt.gca().get_xticklabels(), rotation=30, horizontalalignment='right')
plt.tight_layout()
fig.savefig('gantt.png', dpi=200)
Here's the final output.