How to plot data based on given time? - python

I have a dataset like the one shown below.
Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
I've used pandas to get the data into a DataFrame. The dataset has data for multiple days with an interval of 1 min for each row in the dataset.
I want to plot separate graphs for the voltage with respect to the time(shown in column 2) for each day(shown in column 1) using python. How can I do that?

txt = '''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
plt.plot(df['Time'],df['Voltage'])
plt.show()
gives output :

I believe this will do the trick (I edited the dates so we have two dates)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #If you use Jupyter Notebook
df = pd.read_csv('test.csv', sep=';', usecols=['Date','Time','Voltage'])
unique_dates = df.Date.unique()
for date in unique_dates:
print('Date: ' + date)
df.loc[df.Date == date].plot.line('Time', 'Voltage')
plt.show()
You will get this:

X = df.Date.unique()
for i in X: #iterate over unique days
temp_df = df[df.Date==i] #get df for specific day
temp_df.plot(x = 'Time', y = 'Voltage') #plot
If you want to change x values you can use
x = np.arange(1, len(temp_df.Time), 1)

group by hour and minute after creating a DateTime variable to handle multiple days. you can filter the grouped for a specific day.
txt =
'''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
df['DateTime']=pd.to_datetime(df['Date']+"T"+df['Time']+"Z")
df.set_index('DateTime',inplace=True)
filter=df['Date']=='16/12/2006'
grouped=df[filter].groupby([df.index.hour,df.index.minute])['Voltage'].mean()
grouped.plot()
plt.show()

Related

Unable to create a Plot using the pivoted data set : key error

I want to create a plot chart with forecasted figures for next 2 months. The below is the code I wrote.
import pandas as pd
from datetime import datetime
df= pd.read_csv(r'C:\Users\Desktop\Customers.csv')
parsed = pd.to_datetime(df["Date"], errors="coerce").fillna(pd.to_datetime(df["Date"],format="%Y-%d-%m",errors="coerce"))
ordinal = pd.to_numeric(df["Date"], errors="coerce").apply(lambda x: pd.Timestamp("1899-12-30")+pd.Timedelta(x, unit="D"))
df["Date"] = parsed.fillna(ordinal)
df['Amount currency'] = df['Amount currency'].str.replace(r'[^0-9\.]', '', regex=True)
df['Amount'] = df['Amount'].str.replace(r'[^0-9\.]', '', regex=True)
df['Amount currency'] = pd.to_numeric(df['Amount currency'])
df['Amount'] = pd.to_numeric(df['Amount'])
#df.Date = pd.to_datetime(df.Date).dt.to_period('m')
df['Date'] = df['Date'].dt.to_period('M').dt.to_timestamp() + pd.offsets.MonthEnd()
columns = ['Date', 'Type', 'Amount']
df = df[columns]
and it is required to pivot the figures
df2=pd.pivot_table(df,index='Date',values = 'Amount', columns = 'Type',aggfunc='sum')
So the final output columns are,
Date
Customer Credit Note
Payment
Sales Invoice
Based on the above code, I wanted to create a plot with 2 months of forecast
import matplotlib.pyplot as plt
import seaborn as sns
sns.lineplot(x='Date',y='Payment',data=dataset)
plt.title("Monthly_cases")
plt.xlabel("Month end date")
plt.ylabel("Payment")
plt.show()
But the above code returns with an error named KeyError:Date
What would be the reason for this? Can anyone help me? Also can anyone help me to modify the above code to get next two months forecasted values?
Thanks

Python pandas rolling computations with custom step size

I have a pandas dataframe with daily data. At the last day of each month, I would like to compute a quantity that depends on the daily data of the previous n months (e.g., n=3).
My current solution is to use the pandas rolling function to compute this quantity for every day, and then, only keep the quantities of the last days of each month (and discard all the other quantities). This however implies that I perform a lot of unnecessary computations.
Does somebody of you know how I can improve that?
Thanks a lot in advance!
EDIT:
In the following, I add two examples. In both cases, I compute rolling regressions of stock returns. The first (short) example shows the problem described above and is a sub-problem of my actual problem. The second (long) example shows my actual problem. Therefore, I would either need a solution of the first example that can be embedded in my algorithm for solving the second example or a completely different solution of the second example. Note: The dataframe that I'm using is very large, which means that multiple copies of the entire dataframe are not feasible.
Example 1:
import pandas as pd
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
df = pd.DataFrame(index=dates,columns=['Y','X']).sort_index()
# Generate Data
df['X'] = np.array(range(0,365))
df['Y'] = 3.1*X-2.5
df = df.iloc[random.sample(range(365),280)] # some days are missing
df.iloc[random.sample(range(280),20),0] = np.nan # some observations are missing
df = df.sort_index()
# Compute Beta
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'], sm.add_constant(df.loc[ser.index,'X']), missing = 'drop').fit().params[-1]
df['beta'] = df['Y'].rolling('60D', min_periods=10).apply(estimate_beta) # use last 60 days and require at least 10 observations
# Get last entries per month
df_monthly = df[['beta']].groupby([pd.Grouper(freq='M', level='date')]).agg('last')
df_monthly
Example 2:
import pandas as pd
from pandas import IndexSlice as idx
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
arrays = [dates.tolist()+dates.tolist(),["10000"]*365+["10001"]*365]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["Date", "Stock"])
df = pd.DataFrame(index=index,columns=['Y','X']).sort_index()
# Generate Data
df.loc[idx[:,"10000"],'X'] = X = np.array(range(0,365)).astype(float)
df.loc[idx[:,"10000"],'Y'] = 3*X-2
df.loc[idx[:,"10001"],'X'] = X
df.loc[idx[:,"10001"],'Y'] = -X+1
df = df.iloc[random.sample(range(365*2),360*2)] # some days are missing
df.iloc[random.sample(range(280*2),20*2),0] = np.nan # some observations are missing
# Estimate beta
def estimate_beta_grouped(df_in):
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'].astype(float),sm.add_constant(df.loc[ser.index,'X'].astype(float)), missing = 'drop').fit().params[-1]
df = df_in.droplevel('Stock').reset_index().set_index(['Date']).sort_index()
df['beta'] = df['Y'].rolling('60D',min_periods=10).apply(estimate_beta)
return df[['beta']]
df_beta = df.groupby(level='Stock').apply(estimate_beta_grouped)
# Extract beta at last day per month
df_monthly = df.groupby([pd.Grouper(freq='M', level='Date'), df.index.get_level_values(1)]).agg('last') # get last observations
df_monthly = df_monthly.merge(df_beta, left_index=True, right_index=True, how='left') # merge beta on df_monthly
df_monthly

How to select top n columns from time series data instead of using nlargest in pandas?

I have weekly based trade export time-series data that I need to make a stacked bar plot for visualizing trade activity. To do so, I aggregated my data for sum-up of each columns for all rows, then use nlargest() to select top n columns. However, doing this way might not be quite accurate because I made stacked plot for different years in the loop and top n columns for each year can be different. But what I did, take the total sum of each column for all rows (a.k.a, including all years) then select top n columns, which is biased. So, I am looking at the different way of doing this, perhaps, I might group the time series data by each year then make the stacked plot. Is there other way around selecting top n columns from time-series data instead of using nlargest? Does anyone know any possible way of doing this? What other way we could select top n columns from time-series data? Any idea?
my current attempt:
this is my current attempt to manipulate time series data, where I aggregate each columns for all rows then select top n columns using nlargest():
import pandas as pd
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/a6048e547b5a963c7af356c964d15af6/raw/c57c7915cf14f81edc9d5eadaf14efbd43d3e58a/trade_df.csv'
df_ = pd.read_csv(url, parse_dates=['weekly'])
df_.set_index('weekly', inplace=True)
df_.loc['Total',:]= df_.sum(axis=0)
df1 = df_.T
df1 =df1.nlargest(6, columns=['Total'])
df1.drop('Total', axis=1, inplace=True)
df2 = df1.T
df2.reset_index(inplace=True)
df2['weekly'] = pd.to_datetime(df2['weekly'])
df2['year'] = df2['weekly'].dt.year
df2['week'] = df2['weekly'].dt.strftime('%W').astype('int')
then I visualize the plotting data with matplotlib as follow:
import matplotlib.pyplot as plt
plt_df = df2.set_index(['year','week'])
plt_df.drop("weekly", axis=1, inplace=True)
for n, g in plt_df.groupby(level=0):
ax = g.loc[n].plot.bar(stacked=True, title=f'{n} Year', figsize=(8,5))
plt.show()
although the output of current approach in stacked plot is fine, but selecting top n columns using nlargest() is not quite accurate.for example, in 2019 USDA report, China wasn't top trade partner of US, but in late 2020, China was getting more products from US, and if I use nlargest() to select top column (or trade partners), it is going to be problematic and China won't be in list and not in the plot.
update
As #Vaishali suggested in the comment with this post, using head() might be good idea to extract top columns, so I tried like this:
for n, g in plt_df.groupby(level=0):
for i in g:
gg = g[i].sort_values(g[i].values,ascending = False).groupby('week').head(5)
ax = gg.loc[n].plot.bar(stacked=True, title=f'{n} Year', figsize=(8,5))
but this is not working. Can anyone point me out how to select top n columns from time series data? any idea?
You can try something like this:
url = 'https://gist.githubusercontent.com/adamFlyn/a6048e547b5a963c7af356c964d15af6/raw/c57c7915cf14f81edc9d5eadaf14efbd43d3e58a/trade_df.csv'
df_ = pd.read_csv(url, parse_dates=['weekly'])
df_['weekly'] = pd.to_datetime(df_['weekly'])
df_.set_index('weekly', inplace=True)
for g, n in df_.groupby(df_.index.year):
ng = n.loc[:, n.sum().rank(ascending=False, method='min')<5]
ng.div(ng.sum(axis=1), axis=0).plot.area(title=f'{g}')
Output:
Bar chart:
import matplotlib.ticker as mticker
url = 'https://gist.githubusercontent.com/adamFlyn/a6048e547b5a963c7af356c964d15af6/raw/c57c7915cf14f81edc9d5eadaf14efbd43d3e58a/trade_df.csv'
df_ = pd.read_csv(url, parse_dates=['weekly'])
df_['weekly'] = pd.to_datetime(df_['weekly'])
df_.set_index('weekly', inplace=True)
for g, n in df_.groupby(df_.index.year):
ng = n.loc[:, n.sum().rank(ascending=False, method='min')<5]
ng.index = ng.index.strftime('%m/%d/%Y')
ax = ng.plot.bar(stacked=True, figsize=(10,8))
Output:
Staked 100% Bar chart:
#(previous code)
ax = ng.div(ng.sum(axis=1), axis=0).plot.bar(stacked=True, figsize=(10,8))
Output:
I am not sure I understand the requirement correctly here, but this is based on your output charts:
find top n countries using sum and nlargest
filter df by top_countries, groupby year and week, sum
for each unique year, plot stacked chart
df.columns = df.columns.str.strip()
top_countries = df.iloc[:, 1:].sum().nlargest(6).index.tolist()
df['weekly'] = pd.to_datetime(df['weekly'])
agg = df[top_countries].groupby([df['weekly'].dt.year.rename('year'),df['weekly'].dt.week.rename('week')]).sum()
for year in df['weekly'].dt.year.unique():
agg[agg.index.get_level_values(0) == year].droplevel(level=0).plot.bar(stacked = True, figsize = (10,5), title = year)
Edit:
If you want to filter top countries by year, move the part where you are filtering df into the loop,
df.columns = df.columns.str.strip()
df['weekly'] = pd.to_datetime(df['weekly'])
for year in df['weekly'].dt.year.unique():
top_countries = df.iloc[:, 1:].sum().nlargest(6).index.tolist()
agg = df[top_countries].groupby([df['weekly'].dt.year.rename('year'),df['weekly'].dt.week.rename('week')]).sum()
agg[agg.index.get_level_values(0) == year].droplevel(level=0).plot.bar(stacked = True, figsize = (10,5), title = year)
You can try this
import pandas as pd
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/a6048e547b5a963c7af356c964d15af6/raw/c57c7915cf14f81edc9d5eadaf14efbd43d3e58a/trade_df.csv'
df = pd.read_csv(url, parse_dates=['weekly'])
df['weekly'] = pd.to_datetime(df['weekly'])
df['year'] = df['weekly'].dt.year
df['week'] = df['weekly'].dt.strftime('%W').astype('int')
df.set_index(['year', 'week'], inplace=True)
df.drop('weekly', axis=1, inplace=True)
df_year_sums = df.groupby(level='year').sum().T
for year in df_year_sums.columns:
largest = list(df_year_sums[year].nlargest(6).index)
df_plot = df.xs(year, level='year')[largest]
df_plot.plot.bar(stacked=True, title=f'{year} Year', figsize=(8,5))
df=pd.read_csv('trade_df.csv',parse_dates=['weekly'])
df['weekly']=pd.to_datetime(df['weekly'])
df['Total']=0
df.reset_index()
for key,row in df.iterrows():
sum=0.0
for row_value in row:
if type(row_value)==float:
sum+=row_value
df.loc[key,'Total']=sum
results=df.sort_values(by="Total",ascending=False)
print(results.head(5))
#grouped=df.groupby('year').sum().T.plot.bar(stacked=True)
#plt.show()
filter=df['year'].isin([2018])
results_2018=df[filter].sort_values(by=['total'],ascending=False).head(5)
filter=df['year'].isin([2019])
results_2019=df[filter].sort_values(by=['total'],ascending=False).head(5)
filter=df['year'].isin([2020])
results_2020=df[filter].sort_values(by=['total'],ascending=False).head(5)
grouped=df.groupby('year').sum().T.plot.bar(stacked=True)
plt.show()
fp=results_2018.pivot_table(index=['week'],aggfunc='sum').fillna(0)
fp = fp[(fp.T != 0).any()]
fp2=results_2019.pivot_table(index=['week'],aggfunc='sum').fillna(0)
fp2 = fp2[(fp2.T != 0).any()]
fp3=results_2020.pivot_table(index=['week'],aggfunc='sum').fillna(0)
fp3 = fp3[(fp3.T != 0).any()]
#print(fp)
fig,ax=plt.subplots(3,1,figsize=(16,16))
fp.plot.bar(stacked=True,ax=ax[0])
fp2.plot.bar(stacked=True,ax=ax[1])
fp3.plot.bar(stacked=True,ax=ax[2])
plt.show()
df = pd.DataFrame(np.random.randint(1,100,(100)),columns=["column1"])
results=np.array(df.sort_values(by="column1",ascending=False)).flatten()
print(results[:5])

matplotlib from time series data frame

Say I have a data frame like this:
from pandas import DataFrame
example = {'year_month': [201801,201802,201803,201801,201802,201803],
'store_id': [101,101,101,102,102,102],
'tot_employees': [100,200,150,6,7,10],
'hrs_per_employee': [30,35,20,20,18,15]
}
df = DataFrame(example,columns=["year_month", "store_id", "tot_employees", "hrs_per_employee"])
df
and i want to have stacked subplots with a different subplot for each store_id with:
x axis: year_month
plots line 1: tot employees
plot line 2: hrs per
employee
is this possible with df.plot()? I haven't been able to find the correct x,y inputs to get the result i'm looking for. if not is there a close alternative? thanks in advance
import pandas as pd
df = pd.DataFrame(example)
df.year_month = pd.to_datetime(df.year_month, format='%Y%m', exact=True)
df.set_index('year_month', drop=True, inplace=True)
for x in df.store_id.unique():
df[['tot_employees', 'hrs_per_employee']][df.store_id == x].plot(title=f'Store ID: {x}')
Using only df.groupby
df.groupby('store_id').plot(y=['tot_employees', 'hrs_per_employee'])

Merge Data Frames By Date With Unequal Dates

My process is this:
Import csv of data containing dates, activations, and cancellations
subset the data by activated or cancelled
pivot the data with aggfunc 'sum'
convert back to data frames
Now, I need to merge the 2 data frames together but there are dates that exist in one data frame but not the other. Both data frames start Jan 1, 2017 and end Dec 31, 2017. Preferably, the output for any observation in which the index month needs to be filled with have a corresponding value of 0.
Here's the .head() from both data frames:
For reference, here's the code up to this point:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import datetime
%matplotlib inline
#import data
directory1 = "C:\python\Contracts"
directory_source = os.path.join(directory1, "Contract_Data.csv")
df_source = pd.read_csv(directory_source)
#format date ranges as times
#df_source["Activation_Month"] = pd.to_datetime(df_source["Activation_Month"])
#df_source["Cancellation_Month"] = pd.to_datetime(df_source["Cancellation_Month"])
df_source["Activation_Day"] = pd.to_datetime(df_source["Activation_Day"])
df_source["Cancellation_Day"] = pd.to_datetime(df_source["Cancellation_Day"])
#subset the data based on status
df_active = df_source[df_source["Order Status"]=="Active"]
df_active = pd.DataFrame(df_active[["Activation_Day", "Event_Value"]].copy())
df_cancelled = df_source[df_source["Order Status"]=="Cancelled"]
df_cancelled = pd.DataFrame(df_cancelled[["Cancellation_Day", "Event_Value"]].copy())
#remove activations outside 2017 and cancellations outside 2017
df_cancelled = df_cancelled[(df_cancelled['Cancellation_Day'] > '2016-12-31') &
(df_cancelled['Cancellation_Day'] <= '2017-12-31')]
df_active = df_active[(df_active['Activation_Day'] > '2016-12-31') &
(df_active['Activation_Day'] <= '2017-12-31')]
#pivot the data to aggregate by day
df_active_aggregated = df_active.pivot_table(index='Activation_Day',
values='Event_Value',
aggfunc='sum')
df_cancelled_aggregated = df_cancelled.pivot_table(index='Cancellation_Day',
values='Event_Value',
aggfunc='sum')
#convert pivot tables back to useable dataframes
activations_aggregated = pd.DataFrame(df_active_aggregated.to_records())
cancellations_aggregated = pd.DataFrame(df_cancelled_aggregated.to_records())
#rename the time columns so they can be referenced when merging into one DF
activations_aggregated.columns = ["index_month", "Activations"]
#activations_aggregated = activations_aggregated.set_index(pd.DatetimeIndex(activations_aggregated["index_month"]))
cancellations_aggregated.columns = ["index_month", "Cancellations"]
#cancellations_aggregated = cancellations_aggregated.set_index(pd.DatetimeIndex(cancellations_aggregated["index_month"]))
I'm aware there are many posts that address issues similar to this but I haven't been able to find anything that has helped. Thanks to anyone that can give me a hand with this!
You can try:
activations_aggregated.merge(cancellations_aggregated, how='outer', on='index_month').fillna(0)

Categories

Resources