missing values using pandas.rolling_mean - python

I have lots of missing values when calculating rollng_mean with:
import datetime as dt
import pandas as pd
import pandas.io.data as web
stocklist = ['MSFT', 'BELG.BR']
# read historical prices for last 11 years
def get_px(stock, start):
return web.get_data_yahoo(stock, start)['Adj Close']
today = dt.date.today()
start = str(dt.date(today.year-11, today.month, today.day))
px = pd.DataFrame({n: get_px(n, start) for n in stocklist})
px.ffill()
sma200 = pd.rolling_mean(px, 200)
got following result:
In [14]: px
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 2270 non-null values
MSFT 2769 non-null values
dtypes: float64(2)
In [15]: sma200
Out[15]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 689 non-null values
MSFT 400 non-null values
dtypes: float64(2)
Any idea why most of the sma200 rolling_mean values are missing and how to get the complete list ?

px.ffill() returns a new DataFrame. To modify px itself, use inplace = True.
px.ffill(inplace = True)
sma200 = pd.rolling_mean(px, 200)
print(sma200)
yields
Data columns:
BELG.BR 2085 non-null values
MSFT 2635 non-null values
dtypes: float64(2)

If you print sma200, you will probably find lots of null or missing values. This is because the threshold for number of non-nulls is high by default for rolling_mean.
Try using
sma200 = pd.rolling_mean(px, 200, min_periods=2)
From the pandas docs:
min_periods: threshold of non-null data points to require (otherwise result is NA)
You could also try changing the size of the window if your dataset is missing many points.

Related

python pandas | replacing the date and time string with only time

price
quantity
high time
10.4
3
2021-11-08 14:26:00-05:00
dataframe = ddg
the datatype for hightime is datetime64[ns, America/New_York]
i want the high time to be only 14:26:00 (getting rid of 2021-11-08 and -05:00) but i got an error when using the code below
ddg['high_time'] = ddg['high_time'].dt.strftime('%H:%M')
I think because it's not the right column name:
# Your code
>>> ddg['high_time'].dt.strftime('%H:%M')
...
KeyError: 'high_time'
# With right column name
>>> ddg['high time'].dt.strftime('%H:%M')
0 14:26
Name: high time, dtype: object
# My dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 1 non-null float64
1 quantity 1 non-null int64
2 high time 1 non-null datetime64[ns, America/New_York]
dtypes: datetime64[ns, America/New_York](1), float64(1), int64(1)
memory usage: 152.0 bytes

Pandas resample drops column

Pandas is dropping one of my columns when resampling, and I don't understand why. I've read that this can happen if the columns don't have a proper numerical type, but that isn't the case here:
import pandas;
# movements is the target data frame with daily movements
movements = pandas.DataFrame(columns=['date', 'amount', 'cash']);
movements.set_index('date', inplace=True);
# df is a movement to add
df = pandas.DataFrame({'amount': 179,
'cash': 100.00},
index=[pandas.Timestamp('2015/12/31')]);
print(df); print(df.info()); print();
# add df to movements and resample movements
movements = movements.append(df).resample('D').sum().fillna(0);
print(movements); print(movements.info());
results in:
amount cash
2015-12-31 179 100.0
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1 entries, 2015-12-31 to 2015-12-31
Data columns (total 2 columns):
amount 1 non-null int64
cash 1 non-null float64
dtypes: float64(1), int64(1)
memory usage: 24.0 bytes
None
cash
2015-12-31 100.0
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1 entries, 2015-12-31 to 2015-12-31
Freq: D
Data columns (total 1 columns):
cash 1 non-null float64
dtypes: float64(1)
memory usage: 16.0 bytes
None
I noticed that the drop happens only when cash is a float, i.e. if in the code above cash is set to 100 (int) rather than 100.00, then all columns are int and amount isn't dropped.
Any idea?
The problem is when you created movements DF, the datetype for the columns is set to object.
If you set the column types upfront or change it later to numeric types, it should work.
movements.append(df).apply(pd.to_numeric).resample('D').sum().fillna(0)
Out[100]:
amount cash
2015-12-31 179 100.0

Transform category, start_time, end_time DataFrame for plotting in pandas

I have a pandas DataFrame:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32656 entries, 94418 to 2
Data columns (total 8 columns):
customer_id 32656 non-null object
session_id 32656 non-null int64
start 32656 non-null datetime64[ns, America/Los_Angeles]
end 32656 non-null datetime64[ns, America/Los_Angeles]
length 32656 non-null timedelta64[ns]
category 32656 non-null object
rounded_start 32656 non-null datetime64[ns, America/Los_Angeles]
rounded_end 32656 non-null datetime64[ns, America/Los_Angeles]
dtypes: datetime64[ns, America/Los_Angeles](4), int64(1), object(2), timedelta64[ns](1)
memory usage: 2.2+ MB
I also create a DateTimeIndex:
rng = pd.date_range(df['rounded_start'].min(), end=df['rounded_start'].max(), freq='5Min')
How do I tie the two datasets together so that I can plot each point in the range on the x-axis and shows the count of how many categories are included during that time?
I suspect this will work though I haven't verified.
df_count = pd.DataFrame(index=rng)
def count_cats(x, df):
date = x.name[0]
condition1 = df.start <= date
condition2 = df.end >= date
df_slice = df.loc[condition1 & condition2, 'category']
return pd.Series([df_slice.unique().size], index=['CountCats'])
df_count = df_count.apply(lambda x: count_cats(x, df))

Is there a way to plot a pandas series in ggplot?

I'm experimenting with pandas and non-matplotlib plotting. Good suggestions are here. This question regards yhat's ggplot and I am running into two issues.
Plotting a series in pandas is easy.
frequ.plot()
I don't see how to do this in the ggplot docs. Instead I end up creating a dataframe:
cheese = DataFrame({'time': frequ.index, 'count' : frequ.values})
ggplot(cheese, aes(x='time', y='count')) + geom_line()
I would expect ggplot -- a project that has "tight integration with pandas" -- to have a way to plot a simple series.
Second issue is I can't get stat_smooth() to display when the x axis is time of day. Seems like it could be related to this post, but I don't have the rep to post there. My code is:
frequ = values.sampler.resample("1Min", how="count")
cheese = DataFrame({'time': frequ.index, 'count' : frequ.values})
ggplot(cheese, aes(x='time', y='count')) + geom_line() + stat_smooth()
Any help regarding non-matplotlib plotting would be appreciated. Thanks!
(I'm using ggplot 0.5.8)
I run into this problem frequently in Python's ggplot when working with multiple stock prices and economic timeseries. The key to remember with ggplot is that data is best organized in long format to avoid any issues. I use a quick two step process as a workaround. First let's grab some stock data:
import pandas.io.data as web
import pandas as pd
import time
from ggplot import *
stocks = [ 'GOOG', 'MSFT', 'LNKD', 'YHOO', 'FB', 'GOOGL','HPQ','AMZN'] # stock list
# get stock price function #
def get_px(stock, start, end):
return web.get_data_yahoo(stock, start, end)['Adj Close']
# dataframe of equity prices
px = pd.DataFrame({n: get_px(n, '1/1/2014', date_today) for n in stocks})
px.head()
AMZN FB GOOG GOOGL HPQ LNKD MSFT YHOO
Date
2014-01-02 397.97 54.71 NaN 557.12 27.40 207.64 36.63 39.59
2014-01-03 396.44 54.56 NaN 553.05 28.07 207.42 36.38 40.12
2014-01-06 393.63 57.20 NaN 559.22 28.02 203.92 35.61 39.93
2014-01-07 398.03 57.92 NaN 570.00 27.91 209.64 35.89 40.92
2014-01-08 401.92 58.23 NaN 571.19 27.19 209.06 35.25 41.02
First understand that ggplot needs the datetime index to be a column in the pandas dataframe in order to plot correctly when switching from wide to long format. I wrote a function to address this particular point. It simply creates a 'Date' column of type=datetime from the pandas series index.
def dateConvert(df):
df['Date'] = df.index
df.reset_index(drop=True)
return df
From there run the function on the df. Use the result as the object in pandas pd.melt using the 'Date' as the id_vars. The returned df is now ready to be plotted using the standard ggplot() format.
px_returns = px.pct_change() # common stock transformation
cumRet = (1+px_returns).cumprod() - 1 # transform daily returns to cumulative
cumRet_dateConverted = dateConvert(cumRet) # run the function here see the result below#
cumRet_dateConverted.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 118 entries, 2014-01-02 00:00:00 to 2014-06-20 00:00:00
Data columns (total 9 columns):
AMZN 117 non-null float64
FB 117 non-null float64
GOOG 59 non-null float64
GOOGL 117 non-null float64
HPQ 117 non-null float64
LNKD 117 non-null float64
MSFT 117 non-null float64
YHOO 117 non-null float64
Date 118 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(8)
data = pd.melt(cumRet_dateConverted, id_vars='Date').dropna() # Here is the method I use to format the data in the long format. Please note the use of 'Date' as the id_vars.
data = data.rename(columns = {'Date':'Date','variable':'Stocks','value':'Returns'}) # common to rename these columns
From here you can now plot your data however you want. A common plot I use is the following:
retPlot_YTD = ggplot(data, aes('Date','Returns',color='Stocks')) \
+ geom_line(size=2.) \
+ geom_hline(yintercept=0, color='black', size=1.7, linetype='-.') \
+ scale_y_continuous(labels='percent') \
+ scale_x_date(labels='%b %d %y',breaks=date_breaks('week') ) \
+ theme_seaborn(style='whitegrid') \
+ ggtitle(('%s Cumulative Daily Return vs Peers_YTD') % key_Stock)
fig = retPlot_YTD.draw()
ax = fig.axes[0]
offbox = ax.artists[0]
offbox.set_bbox_to_anchor((1, 0.5), ax.transAxes)
fig.show()
This is more of a workaround but you can use qplot for quick, shorthand plots using series.
from ggplot import *
qplot(meat.beef)

Plotting with GroupBy in Pandas/Python

Although it is straight-forward and easy to plot groupby objects in pandas, I am wondering what the most pythonic (pandastic?) way to grab the unique groups from a groupby object is. For example:
I am working with atmospheric data and trying to plot diurnal trends over a period of several days or more. The following is the DataFrame containing many days worth of data where the timestamp is the index:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10909 entries, 2013-08-04 12:01:00 to 2013-08-13 17:43:00
Data columns (total 17 columns):
Date 10909 non-null values
Flags 10909 non-null values
Time 10909 non-null values
convt 10909 non-null values
hino 10909 non-null values
hinox 10909 non-null values
intt 10909 non-null values
no 10909 non-null values
nox 10909 non-null values
ozonf 10909 non-null values
pmtt 10909 non-null values
pmtv 10909 non-null values
pres 10909 non-null values
rctt 10909 non-null values
smplf 10909 non-null values
stamp 10909 non-null values
no2 10909 non-null values
dtypes: datetime64[ns](1), float64(11), int64(2), object(3)
To be able to average (and take other statistics) the data at every minute for several days, I group the dataframe:
data = no.groupby('Time')
I can then easily plot the mean NO concentration as well as quartiles:
ax = figure(figsize=(12,8)).add_subplot(111)
title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
ylabel('Concentration [ppb]')
data.no.mean().plot(ax=ax, style='b', label='Mean')
data.no.apply(lambda x: percentile(x, 25)).plot(ax=ax, style='r', label='25%')
data.no.apply(lambda x: percentile(x, 75)).plot(ax=ax, style='r', label='75%')
The issue that fuels my question, is that in order to plot more interesting looking things like plots using like fill_between() it is necessary to know the x-axis information per the documentation
fill_between(x, y1, y2=0, where=None, interpolate=False, hold=None, **kwargs)
For the life of me, I cannot figure out the best way to accomplish this. I have tried:
Iterating over the groupby object and creating an array of the groups
Grabbing all of the unique Time entries from the original DataFrame
I can make these work, but I know there is a better way. Python is far too beautiful. Any ideas/hints?
UPDATES:
The statistics can be dumped into a new dataframe using unstack() such as
no_new = no.groupby('Time')['no'].describe().unstack()
no_new.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1440 entries, 00:00 to 23:59
Data columns (total 8 columns):
count 1440 non-null values
mean 1440 non-null values
std 1440 non-null values
min 1440 non-null values
25% 1440 non-null values
50% 1440 non-null values
75% 1440 non-null values
max 1440 non-null values
dtypes: float64(8)
Although I should be able to plot with fill_between() using no_new.index, I receive a TypeError.
Current Plot code and TypeError:
ax = figure(figzise=(12,8)).add_subplot(111)
ax.plot(no_new['mean'])
ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
TypeError:
TypeError Traceback (most recent call last)
<ipython-input-6-47493de920f1> in <module>()
2 ax = figure(figsize=(12,8)).add_subplot(111)
3 ax.plot(no_new['mean'])
----> 4 ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
5 #title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
6 #ylabel('Concentration [ppb]')
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\matplotlib\axes.pyc in fill_between(self, x, y1, y2, where, interpolate, **kwargs)
6986
6987 # Convert the arrays so we can work with them
-> 6988 x = ma.masked_invalid(self.convert_xunits(x))
6989 y1 = ma.masked_invalid(self.convert_yunits(y1))
6990 y2 = ma.masked_invalid(self.convert_yunits(y2))
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\ma\core.pyc in masked_invalid(a, copy)
2237 cls = type(a)
2238 else:
-> 2239 condition = ~(np.isfinite(a))
2240 cls = MaskedArray
2241 result = a.view(cls)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The plot as of now looks like this:
Storing the groupby stats (mean/25/75) as columns in a new dataframe and then passing the new dataframe's index as the x parameter of plt.fill_between() works for me (tested with matplotlib 1.3.1). e.g.,
gdf = df.groupby('Time')[col].describe().unstack()
plt.fill_between(gdf.index, gdf['25%'], gdf['75%'], alpha=.5)
gdf.info() should look like this:
<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 00:00:00 to 22:00:00
Data columns (total 8 columns):
count 12 non-null float64
mean 12 non-null float64
std 12 non-null float64
min 12 non-null float64
25% 12 non-null float64
50% 12 non-null float64
75% 12 non-null float64
max 12 non-null float64
dtypes: float64(8)
Update: to address the TypeError: ufunc 'isfinite' not supported exception, it is necessary to first convert the Time column from a series of string objects in "HH:MM" format to a series of datetime.time objects, which can be done as follows:
df['Time'] = df.Time.map(lambda x: pd.datetools.parse(x).time())

Categories

Resources