I'm experimenting with pandas and non-matplotlib plotting. Good suggestions are here. This question regards yhat's ggplot and I am running into two issues.
Plotting a series in pandas is easy.
frequ.plot()
I don't see how to do this in the ggplot docs. Instead I end up creating a dataframe:
cheese = DataFrame({'time': frequ.index, 'count' : frequ.values})
ggplot(cheese, aes(x='time', y='count')) + geom_line()
I would expect ggplot -- a project that has "tight integration with pandas" -- to have a way to plot a simple series.
Second issue is I can't get stat_smooth() to display when the x axis is time of day. Seems like it could be related to this post, but I don't have the rep to post there. My code is:
frequ = values.sampler.resample("1Min", how="count")
cheese = DataFrame({'time': frequ.index, 'count' : frequ.values})
ggplot(cheese, aes(x='time', y='count')) + geom_line() + stat_smooth()
Any help regarding non-matplotlib plotting would be appreciated. Thanks!
(I'm using ggplot 0.5.8)
I run into this problem frequently in Python's ggplot when working with multiple stock prices and economic timeseries. The key to remember with ggplot is that data is best organized in long format to avoid any issues. I use a quick two step process as a workaround. First let's grab some stock data:
import pandas.io.data as web
import pandas as pd
import time
from ggplot import *
stocks = [ 'GOOG', 'MSFT', 'LNKD', 'YHOO', 'FB', 'GOOGL','HPQ','AMZN'] # stock list
# get stock price function #
def get_px(stock, start, end):
return web.get_data_yahoo(stock, start, end)['Adj Close']
# dataframe of equity prices
px = pd.DataFrame({n: get_px(n, '1/1/2014', date_today) for n in stocks})
px.head()
AMZN FB GOOG GOOGL HPQ LNKD MSFT YHOO
Date
2014-01-02 397.97 54.71 NaN 557.12 27.40 207.64 36.63 39.59
2014-01-03 396.44 54.56 NaN 553.05 28.07 207.42 36.38 40.12
2014-01-06 393.63 57.20 NaN 559.22 28.02 203.92 35.61 39.93
2014-01-07 398.03 57.92 NaN 570.00 27.91 209.64 35.89 40.92
2014-01-08 401.92 58.23 NaN 571.19 27.19 209.06 35.25 41.02
First understand that ggplot needs the datetime index to be a column in the pandas dataframe in order to plot correctly when switching from wide to long format. I wrote a function to address this particular point. It simply creates a 'Date' column of type=datetime from the pandas series index.
def dateConvert(df):
df['Date'] = df.index
df.reset_index(drop=True)
return df
From there run the function on the df. Use the result as the object in pandas pd.melt using the 'Date' as the id_vars. The returned df is now ready to be plotted using the standard ggplot() format.
px_returns = px.pct_change() # common stock transformation
cumRet = (1+px_returns).cumprod() - 1 # transform daily returns to cumulative
cumRet_dateConverted = dateConvert(cumRet) # run the function here see the result below#
cumRet_dateConverted.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 118 entries, 2014-01-02 00:00:00 to 2014-06-20 00:00:00
Data columns (total 9 columns):
AMZN 117 non-null float64
FB 117 non-null float64
GOOG 59 non-null float64
GOOGL 117 non-null float64
HPQ 117 non-null float64
LNKD 117 non-null float64
MSFT 117 non-null float64
YHOO 117 non-null float64
Date 118 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(8)
data = pd.melt(cumRet_dateConverted, id_vars='Date').dropna() # Here is the method I use to format the data in the long format. Please note the use of 'Date' as the id_vars.
data = data.rename(columns = {'Date':'Date','variable':'Stocks','value':'Returns'}) # common to rename these columns
From here you can now plot your data however you want. A common plot I use is the following:
retPlot_YTD = ggplot(data, aes('Date','Returns',color='Stocks')) \
+ geom_line(size=2.) \
+ geom_hline(yintercept=0, color='black', size=1.7, linetype='-.') \
+ scale_y_continuous(labels='percent') \
+ scale_x_date(labels='%b %d %y',breaks=date_breaks('week') ) \
+ theme_seaborn(style='whitegrid') \
+ ggtitle(('%s Cumulative Daily Return vs Peers_YTD') % key_Stock)
fig = retPlot_YTD.draw()
ax = fig.axes[0]
offbox = ax.artists[0]
offbox.set_bbox_to_anchor((1, 0.5), ax.transAxes)
fig.show()
This is more of a workaround but you can use qplot for quick, shorthand plots using series.
from ggplot import *
qplot(meat.beef)
Related
I have to create a function the brings me the count of days between two dates and it must take out weekends and holidays that are inside of a dataframe.
my holidays df looks like this:
Data
0 2001-01-01
1 2001-02-26
2 2001-02-27
3 2001-04-13
4 2001-04-21
df.info()
class 'pandas.core.frame.DataFrame'
RangeIndex: 936 entries, 0 to 935
Data columns (total 1 columns):
Column Non-Null Count Dtype
0 Data 936 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 7.4 KB
So it should look like:
def delta_days (date_initial, date_end, holidays)
.....
What would be the best way?
Here you go:
import pandas as pd
import datetime
from datetime import datetime
def delta_days(date_initial, date_end, holidays):
date_initial = datetime.strptime(date_initial,'%Y-%m-%d')
date_end = datetime.strptime(date_end,'%Y-%m-%d')
work_days = pd.bdate_range(start=date_initial, end=date_end, holidays=holidays, freq='C')
return(len(work_days))
Testing the code:
holidays = ['2021-01-01','2021-04-04','2021-04-21','2021-05-01','2021-09-07','2021-10-12','2021-11-02','2021-11-15','2021-12-25']
delta_days('2021-01-01','2021-12-31',holidays=holidays)
Output:
255
Now, you can go one step further and automate the construction of the holidays list:
from workalendar.america import Brazil
cal = Brazil()
datetime_feriados = pd.to_datetime([d[0] for d in cal.holidays(2021)])
lista_feriados = [lista_feriados.strftime('%Y-%m-%d') for lista_feriados in datetime_feriados]
Output:
lista_feriados
['2021-01-01','2021-04-04','2021-04-21','2021-05-01','2021-09-07','2021-10-12','2021-11-02','2021-11-15','2021-12-25']
I have a data frame that has a DateTime column that I am trying to use for my plot visualization, but I am thrown an error when setting the column as my x-axis. I noticed in the documentation where lineplot accepts a number, not a datetime, but that seems strange to me as I have seen seaborn graphs with dates as the x-axis label. Is there something that I am doing wrong?
test_data = pd.read_csv('organic-sessions.csv', header=0, index_col='date', parse_dates=['date'])
test_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 730 entries, 2018-01-01 to 2019-12-31
Data columns (total 1 columns):
organic 274 non-null float64
dtypes: float64(1)
memory usage: 11.4 KB
sns.lineplot(x='date', y="organic", legend='full', data=test_data)
ValueError: Could not interpret input 'date'
I'm trying to implement Time series forecasting in Python using Pandas n statsmodels. I want to read a local excel file data and then decompose it into Trends & Seasonal components but I'm unable to get any relevant links which shows how to load data from local drive.
I tried using the following code:
excel = pandas.ExcelFile( 'PET_PRI_SPT_S1_D.xls' )
df = excel.parse( excel.sheet_names[1] )
dta = sm.datasets.co2.load_pandas().data
dta.co2.interpolate(inplace=True)
res = sm.tsa.seasonal_decompose(dta.co2)
resplot = res.plot()
res.resid
But its if I try printing dta variable, it shows some other data and not PET_PRI_SPT_S1_D.xls. Even resplot=res.plot() doesn't seem to work and no plot shows up.
Can you please guide me on how to load data from local drive into pandas dataframe.
Edit 1:
I get following when I tried df.info(). Here df is object of my excel file.
DatetimeIndex: 7509 entries, 1986-01-24 00:00:00 to 2015-06-08 00:00:00 Data columns (total 2 columns): WTI 7408 non-null object Brent 7115 non-null object dtypes: object(2) memory usage: 117.3+ KB
When I try dta.info() where dta is object of sm.datasets.co2 type, I get following.
DatetimeIndex: 2284 entries, 1958-03-29 00:00:00 to 2001-12-29 00:00:00 Freq: W-SAT Data columns (total 1 columns): co2 2284 non-null float64 dtypes: float64(1) memory usage: 35.7 KB
I have lots of missing values when calculating rollng_mean with:
import datetime as dt
import pandas as pd
import pandas.io.data as web
stocklist = ['MSFT', 'BELG.BR']
# read historical prices for last 11 years
def get_px(stock, start):
return web.get_data_yahoo(stock, start)['Adj Close']
today = dt.date.today()
start = str(dt.date(today.year-11, today.month, today.day))
px = pd.DataFrame({n: get_px(n, start) for n in stocklist})
px.ffill()
sma200 = pd.rolling_mean(px, 200)
got following result:
In [14]: px
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 2270 non-null values
MSFT 2769 non-null values
dtypes: float64(2)
In [15]: sma200
Out[15]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 689 non-null values
MSFT 400 non-null values
dtypes: float64(2)
Any idea why most of the sma200 rolling_mean values are missing and how to get the complete list ?
px.ffill() returns a new DataFrame. To modify px itself, use inplace = True.
px.ffill(inplace = True)
sma200 = pd.rolling_mean(px, 200)
print(sma200)
yields
Data columns:
BELG.BR 2085 non-null values
MSFT 2635 non-null values
dtypes: float64(2)
If you print sma200, you will probably find lots of null or missing values. This is because the threshold for number of non-nulls is high by default for rolling_mean.
Try using
sma200 = pd.rolling_mean(px, 200, min_periods=2)
From the pandas docs:
min_periods: threshold of non-null data points to require (otherwise result is NA)
You could also try changing the size of the window if your dataset is missing many points.
I worked now for quite some time using python and pandas for analysing a set of hourly data and find it quite nice (Coming from Matlab.)
Now I am kind of stuck. I created my DataFrame like that:
SamplingRateMinutes=60
index = DateRange(initialTime,finalTime, offset=datetools.Minute(SamplingRateMinutes))
ts=DataFrame(data, index=index)
What I want to do now is to select the Data for all days at the hours 10 to 13 and 20-23 to use the data for further calculations.
So far I sliced the data using
selectedData=ts[begin:end]
And I am sure to get some kind of dirty looping to select the data needed. But there must be a more elegant way to index exacly what I want. I am sure this is a common problem and the solution in pseudocode should look somewhat like that:
myIndex=ts.index[10<=ts.index.hour<=13 or 20<=ts.index.hour<=23]
selectedData=ts[myIndex]
To mention I am an engineer and no programer :) ... yet
In upcoming pandas 0.8.0, you'll be able to write
hour = ts.index.hour
selector = ((10 <= hour) & (hour <= 13)) | ((20 <= hour) & (hour <= 23))
data = ts[selector]
Here's an example that does what you want:
In [32]: from datetime import datetime as dt
In [33]: dr = p.DateRange(dt(2009,1,1),dt(2010,12,31), offset=p.datetools.Hour())
In [34]: hr = dr.map(lambda x: x.hour)
In [35]: dt = p.DataFrame(rand(len(dr),2), dr)
In [36]: dt
Out[36]:
<class 'pandas.core.frame.DataFrame'>
DateRange: 17497 entries, 2009-01-01 00:00:00 to 2010-12-31 00:00:00
offset: <1 Hour>
Data columns:
0 17497 non-null values
1 17497 non-null values
dtypes: float64(2)
In [37]: dt[(hr >= 10) & (hr <=16)]
Out[37]:
<class 'pandas.core.frame.DataFrame'>
Index: 5103 entries, 2009-01-01 10:00:00 to 2010-12-30 16:00:00
Data columns:
0 5103 non-null values
1 5103 non-null values
dtypes: float64(2)
As it looks messy in my comment above, I decided to provide another answer which is a syntax update for pandas 0.10.0 on Marc's answer, combined with Wes' hint:
import pandas as pd
from datetime import datetime
dr = pd.date_range(datetime(2009,1,1),datetime(2010,12,31),freq='H')
dt = pd.DataFrame(rand(len(dr),2),dr)
hour = dt.index.hour
selector = ((10 <= hour) & (hour <= 13)) | ((20<=hour) & (hour<=23))
data = dt[selector]
Pandas DataFrame has a built-in function
pandas.DataFrame.between_time
df = pd.DataFrame(np.random.randn(1000, 2),
index=pd.date_range(start='2017-01-01', freq='10min', periods=1000))
Create 2 data frames for each period of time:
df1 = df.between_time(start_time='10:00', end_time='13:00')
df2 = df.between_time(start_time='20:00', end_time='23:00')
Data frame you want is merged and sorted df1 and df2:
pd.concat([df1, df2], axis=0).sort_index()