Plotting irregular time-series (multiple) from dataframe using ggplot - python

I have a df structured as so:
UnitNo Time Sensor
0 1.0 2016-07-20 18:34:44 19.0
1 1.0 2016-07-20 19:27:39 19.0
2 3.0 2016-07-20 20:45:39 17.0
3 3.0 2016-07-20 23:05:29 17.0
4 3.0 2016-07-21 01:23:30 11.0
5 2.0 2016-07-21 04:23:59 11.0
6 2.0 2016-07-21 17:33:29 2.0
7 2.0 2016-07-21 18:55:04 2.0
I want to create a time-series plot where each UnitNo has its own line (color) and the y-axis values correspond to Sensor and the x-axis is Time. I want to do this in ggplot, but I am having trouble figuring out how to do this efficiently. I have looked at previous examples but they all have regular time series, i.e., observations for each variable occur at the same times which makes it easy to create a time index. I imagine I can loop through and add data to plot(?), but I was wondering if there was a more efficient/elegant way forward.

df.set_index('Time').groupby('UnitNo').Sensor.plot();

I think you need pivot or set_index and unstack with DataFrame.plot:
df.pivot('Time', 'UnitNo','Sensor').plot()
Or:
df.set_index(['Time', 'UnitNo'])['Sensor'].unstack().plot()
If some duplicates:
df = df.groupby(['Time', 'UnitNo'])['Sensor'].mean().unstack().plot()
df = df.pivot_table(index='Time', columns='UnitNo',values='Sensor', aggfunc='mean').plot()

Related

Graphing a group of data as a density function

[.csv data looks like this:]
man birthyear claim number_items_reported Impression age_group
0 1977 110.0 1 2.0 older_adult
0 1987 12.0 1 2.0 adult
1 1982 628.0 1 0.0 adult
1 1968 503.6 1 0.0 older_adult
1 1980 807.8 3 2.0 older_adult
I have grouped the data from criteria to get the mean for four outcomes with (df.groupby). I am trying to apply results graphically, with the mean data (from the df.groupby function below) on the y-axis and number_items_reported on the x-axis. I appreciate your reply.
I have done the following but keep getting an error:
import pandas as pd
# Use GroupBy() & compute claim mean for each impression
impression_group = df.groupby('Impression')['claim'].mean()
print(impression_group)
Returning:
Impression
0.0 911.253743
1.0 866.242697
2.0 791.260000
3.0 818.035949
Name: claim, dtype: float64
Entering data commands for the graph:
#define index column
df.set_index('number_items_reported', inplace=True)
#group data by Impression and display mean claims for each Impression as line chart
df.groupby('Impression')['claim'].plot(legend=True)
Returning:
KeyError: "None of ['number_items_reported'] are in the columns"

Not possible to give column names to concatenated Pandas Series

From a pandas dataframe I calculate mean(), sd() and max() from all variables with pandas built in functions. I get back three pandas serieses.
import pandas as pd
df_FALKO_R_scores_mean = df_FALKO_R_scores_only.mean()
df_FALKO_R_scores_sd = df_FALKO_R_scores_only.std()
df_FALKO_R_scores_max = df_FALKO_R_scores_only.max()
Than I concatenate the three serieses to get an output of mean, sd and max for every variable.
The Problem is, as you can see below, although I add "names" to the concat() function, the labels of the variables are named 0, 1 and 2. This is not readable, especially if I want to plot those numbers. How can I manage to get a Pandas series with the column labels ['mean','sd','max']? I also tried "ignore_index" True and False.
df_FALKO_R_scores_mean_sd_max = pd.concat([df_FALKO_R_scores_mean, df_FALKO_R_scores_sd, df_FALKO_R_scores_max], names=['mean', 'sd', 'max'], axis=1, ignore_index=True)
print(df_FALKO_R_scores_mean_sd_max)
Output:
0 1 2
R_fd_s_01a_s 1.026490 0.631897 2.0
R_fd_e_01b_s 0.794702 0.802645 2.0
R_fd_e_01c_s 1.039735 1.124757 4.0
R_fd_p_02a_s 1.390728 0.848320 3.0
R_fd_p_02b_s 0.880795 0.552897 2.0
R_fd_p_03_s 1.132450 1.004493 3.0
R_fd_s_04_s 0.834437 0.769679 2.0
R_fd_e_05_s 0.403974 0.694539 2.0
R_fd_p_06a_s 1.105960 0.644488 2.0
R_fd_e_06b_s 1.337748 0.979030 3.0
R_fd_e_07_s 1.192053 1.320178 4.0
R_fd_e_08a_s 0.748344 0.741337 2.0
R_fd_e_08b_s 0.529801 0.737635 2.0
R_fd_p_09a_s 1.688742 1.312430 4.0
R_fd_p_09b_s 0.701987 0.839005 3.0
R_fw_01_s 0.774834 0.731867 2.0
R_fw_02_s 0.761589 0.797568 2.0
R_fw_03_s 0.841060 0.857070 2.0
R_fw_04_s 0.589404 0.675983 2.0
R_fw_05_s 0.403974 0.655020 2.0
R_fw_06_s 0.211921 0.441351 2.0
R_fw_07_s 0.536424 0.789724 2.0
R_fw_08_s 0.927152 0.566855 2.0
R_fw_09a_s 1.317881 0.843571 2.0
Thanks for any help!
Why don't you use agg() instead of creating three different calculations and concatenating the results?
df_FALKO_R_scores_only.agg(['mean', 'std', 'max'], axis=1)
It will give you results with proper column names.
You didn't add any input, but I believe it could work in this case.
EDIT:
If you want to use pd.concat, you can name each series, example:
df_FALKO_R_scores_mean.name = 'mean'
Or you can just name output columns by using a list.
df_FALKO_R_scores_mean_sd_max.columns = ['mean', 'std', 'max']

Seaborn: How to handle gap between historic and forecasted values?

I have the problem explaining a gap between the historic data and the forecast.
The blue is the historic. And the orange is the lin lin regression forecast with future values.
Dataframe df is the training dataset with columns year, pax, RealGDPLP.
Dataframe FutureValCPs has the columns year and RealGDPLP.
How do you explaing that it is not continuous (in other cases it is)?
The OLS results are attached. Anything which gives an indication?
Thank you.
With no data, no code and no details about the graphical engine applied to produce your plot it's going to be hard to be absolutely certain. But your forecasts seem perfectly fine compared to your historical data in that it at the very least predicts a smooth future increase in your values. If the blue line represents your entire dataset, there's really not much more that can be said using OLS.
The reason why there's a gap in your plot, is that the two lines in your plot are two different lines and don't share a common timestamp in the transition between historical and forecasted values. There are ways to visually remedy this, but as I've mentioned I have no idea how you've estimated the model or produced this plot.
Edit: Extended answer based on more information from OP:
This should resemble your issue with regards to the plot:
I'm assuming that the following dataframe will represent your situation:
historic forecast
dates
2020-01-01 1.0 NaN
2020-01-02 2.0 NaN
2020-01-03 3.0 NaN
2020-01-04 3.0 NaN
2020-01-05 6.0 NaN
2020-01-06 4.0 NaN
2020-01-07 8.0 NaN
2020-01-08 NaN 6.0
2020-01-09 NaN 7.0
2020-01-10 NaN 8.0
2020-01-11 NaN 9.0
2020-01-12 NaN 10.0
2020-01-13 NaN 11.0
2020-01-14 NaN 12.0
And I think this is a perfectly natural situation for series for historic and forecasted values; there's no reason why there should not be a visual gap between them. Now, one way to visually remedy this could be to include the forecasted value of 6.0 at index 2020-01-08 for the historic series, or the historic value of 8 at index 2020-01-08 for the forecasts. You can do so using df['forecast'].loc['2020-01-07']=8.0 or df['historic'].loc['2020-01-08']=6.0. This can of course be done more smoothly by programmatically determining the inserted value and the index. But here's the result either way:
Complete code:
import seaborn as sns
import pandas as pd
sns.set_style("darkgrid")
plt.xticks(rotation=45)
#sns.set_xticklabels(rotation=45)
%matplotlib inline
df_historic = pd.DataFrame({'dates': pd.date_range("20200101", periods=7),
'historic': [1,2,3,3,6,4,8]}).set_index('dates')
df_forecast = pd.DataFrame({'dates': pd.date_range("20200108", periods=7),
'forecast': [6,7,8,9,10,11,12]}).set_index('dates')
df=pd.merge(df_historic, df_forecast, how='outer', left_index=True, right_index=True)
#df['forecast'].loc['2020-01-07']=8.0
df['historic'].loc['2020-01-08']=6.0
for column in df.columns:
g=sns.lineplot(x=df.index, y=df[column])
g.set_xticklabels(labels=df.index, rotation=-20)
I hope this helps!

How to populate a column in one dataframe by comparing it to another dataframe

I have a dataframe called res_df:
In [54]: res_df.head()
Out[54]:
Bldg_Sq_Ft GEOID CensusPop HU_Pop Pop_By_Area
0 753.026123 240010013002022 11.0 7.0 NaN
7 95.890495 240430003022003 17.0 8.0 NaN
8 1940.862793 240430003022021 86.0 33.0 NaN
24 2254.519775 245102801012021 27.0 13.0 NaN
25 11685.613281 245101503002000 152.0 74.0 NaN
I have a second dataframe made from the summarized information in res_df. It's grouped by the GEOID column and then summarized using aggregations to get the sum of the Bldg_Sq_Ft and the mean of the CensusPop columns for each unique GEOID. Let's call it geoid_sum:
In [55]:geoid_sum = geoid_sum.groupby('GEOID').agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'})
In [56]: geoid_sum.head()
Out[56]:
GEOID Bldg_Sq_Ft CensusPop
GEOID
100010431001011 1 1154.915527 0.0
100030144041044 1 5443.207520 26.0
100050519001066 1 1164.390503 4.0
240010001001001 15 30923.517090 41.0
240010001001007 3 6651.656677 0.0
My goal is to find the GEOIDs in res_df that match the GEOID's in geoid_sum. I want to populate the value in Pop_By_Area for that row using an equation:
Pop_By_Area = (geoid_sum['CensusPop'] * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft']
I've created a simple function that takes those parameters, but I am unsure how to iterate through the dataframes and apply the function.
def popByArea(census_pop_mean, bldg_sqft, bldg_sqft_sum):
x = float()
x = (census_pop_mean * bldg_sqft)/bldg_sqft_sum
return x
I've tried creating a series based on the GEOID matches: s = res_df.GEOID.isin(geoid_sum.GEOID.values) but that didn't seem to work (produced all false boolean values). How can I find the matches and apply my function to populate the Pop_By_Area column?
I think you need the reindex
geoid_sum = geoid_sum.groupby('GEOID').\
agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'}).\
reindex(res_df['GEOID'])
res_df['Pop_By_Area'] = (geoid_sum['CensusPop'].values * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft'].values

pandas - automate graph using the combination of two columns

What is the best way to automate the graph production in the following case:
I have a data frame with different plan and type in the columns
I want a graph for each combination of plan and type
Dataframe:
plan type hour ok notok other
A cont 0 60.0 40.0 0.0
A cont 1 56.6 31.2 12.2
A vend 2 30.0 50.0 20.0
B test 5 20.0 50.0 30.0
For one df with only one plan and type, I wrote the following code:
fig_ = df.set_index('hour').plot(kind='bar', stacked=True, colormap='YlOrBr')
plt.xlabel('Hour')
plt.ylabel('(%)')
fig_.figure.savefig('p_hour.png', dpi=1000)
plt.show()
In the end, I would like to save one different figure for each combination of plan and type.
Thanks in advance!
You can try iterating over groups using groupby:
for (plan, type), group in df.groupby(['plan', 'type']):
fig_ = group.set_index('hour').plot(kind='bar', stacked=True, colormap='YlOrBr')
plt.xlabel('Hour') # Maybe add plan and type
plt.ylabel('(%)') # Maybe add plan and type
fig_.figure.savefig('p_hour_{}_{}.png'.format(plan, type), dpi=1000)

Categories

Resources