Seaborn Not Accepting Datetime value - python

I have a data frame that has a DateTime column that I am trying to use for my plot visualization, but I am thrown an error when setting the column as my x-axis. I noticed in the documentation where lineplot accepts a number, not a datetime, but that seems strange to me as I have seen seaborn graphs with dates as the x-axis label. Is there something that I am doing wrong?
test_data = pd.read_csv('organic-sessions.csv', header=0, index_col='date', parse_dates=['date'])
test_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 730 entries, 2018-01-01 to 2019-12-31
Data columns (total 1 columns):
organic 274 non-null float64
dtypes: float64(1)
memory usage: 11.4 KB
sns.lineplot(x='date', y="organic", legend='full', data=test_data)
ValueError: Could not interpret input 'date'

Related

How to transform a Dataframe into a Series with Darts including the DatetimeIndex?

My Dataframe, temperature measurings over time:
[]
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 17545 entries, 2020-01-01 00:00:00+00:00 to 2022-01-01 00:00:00+00:00
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 T (degC) 17545 non-null float64
dtypes: float64(1)
memory usage: 274.1 KB
After transforming the dataframe into a Time Series with
df_series = TimeSeries.from_dataframe(df)
df_series
the result looks like:
For this reason, I cant plot the Series.
TypeError: Plotting requires coordinates to be numeric, boolean, or dates of type numpy.datetime64, datetime.datetime, cftime.datetime or pandas.Interval. Received data of type object instead.
I expected something like this from the darts doc (https://unit8co.github.io/darts/):
df
The DataFrame
time_col
The time column name. If set, the column will be cast to a pandas DatetimeIndex.
If not set, the DataFrame index will be used. In this case the DataFrame must contain an index that is
either a pandas DatetimeIndex or a pandas RangeIndex. If a DatetimeIndex is
used, it is better if it has no holes; alternatively setting fill_missing_dates can in some casees solve
these issues (filling holes with NaN, or with the provided fillna_value numeric value, if any).
In case about the above method description I don't know why it changed my DatetimeIndex to object.
Any suggestions on that?
Thanks.
I had the same issue. Darts doesn't work with datetime64[ns, utc], but works with datetime64[ns]. Darts doesn't recognise datetime64[ns, utc] as datatime type of value.
This fix it by doing datetime64[ns, utc] -> datetime64[ns]:
def set_index(df):
df['open_time'] = pd.to_datetime(df['open_time'], infer_datetime_format=True).dt.tz_localize(None)
df.set_index(keys='open_time', inplace=True, drop=True)
return df

pd.to_datetime() not properly converting the type to datetime

I am looking to parse data with multiple timezones on a single column. I am using the pd.to_datetime function.
df = pd.DataFrame({'timestamp':['2019-05-21 12:00:00-06:00', '2019-05-21 12:15:00-07:00']})
df['timestamp'] = pd.to_datetime(df.timestamp)
df.info()
This results in:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 2 non-null object
dtypes: object(1)
memory usage: 144.0+ bytes
I did some testing and noticed that the same does not happen when the offsets are all the same:
df = pd.DataFrame({'timestamp':['2019-05-21 12:00:00-06:00', '2019-05-21 12:15:00-06:00']})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 2 non-null datetime64[ns, pytz.FixedOffset(-360)]
dtypes: datetime64[ns, pytz.FixedOffset(-360)](1)
memory usage: 144.0 bytes
If this error is confirmed, it will have direct implication over the datetime accessors, but it also breaks some compatibility (or assumed compatibilities) with library that operate conversions on the types. The pd.to_datetime() is successfully able to convert everything to a datetime.datetime, but, libraries like pyarrow will apply a fixed tz offset on the column.
Based on many questions on StackOverflow (ex: Convert pandas column with multiple timezones to single timezone) this was not the behavior of pandas in previous versions.
I am on pandas 1.2.4 (I updated from 1.2.2 that shows the same). Python 3.7.9.
Should I report this as a GitHub issue?
I'd suggest to keep the original timestamp column with the offset (to not lose that info) and work with UTC (utc=True). If you know the time zone that put that offset on the data, you could also tz_convert.
Ex / cleaned-up version of the linked question:
import pandas as pd
# sample data
df = pd.DataFrame({'timestamp':['2019-05-21 12:00:00-06:00',
'2019-02-21 12:15:00-07:00']})
# assuming we know the origin time zone
zone = 'America/Boise'
# skip the .dt.tz_convert(zone) part if you don't have the specific zone
df['datetime'] = pd.to_datetime(df['timestamp'], utc=True).dt.tz_convert(zone)
df
timestamp datetime
0 2019-05-21 12:00:00-06:00 2019-05-21 12:00:00-06:00
1 2019-02-21 12:15:00-07:00 2019-02-21 12:15:00-07:00
df['datetime']
0 2019-05-21 12:00:00-06:00
1 2019-02-21 12:15:00-07:00
Name: datetime, dtype: datetime64[ns, America/Boise]

How to plot a large dataframe

This is what my dataframe looks like :
Date,Sales, location
There are a total of 20k entries. Dates are from 2016-2019. I need to have dates on x axis and sales on y axis. This is how I have done it
df1.plot(x="DATE", y=["Total_Sales"], kind="bar", figsize=(1000,20))
Unfortunately even with this the dates aren't clearly visible. How do I make sure that they are pretty plotted? Is there a way to create bins or something.
Edit: output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 382 entries, 0 to 18116
Data columns (total 5 columns):
DATE 382 non-null object
Total_Sales 358 non-null float64
Total_Sum 24 non-null float64
Total_Units 382 non-null int64
locationkey 382 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 17.9+ KB
Edit: Maybe I can divide it into years stacked on top of each other. So, for instance, Jan to Dec 16 will be the first and then succeeding years will be plotted with it. How do I do that?
I recommend that you do this:
df.DATE = pd.to_datetime(df.DATE)
df = df.set_index('DATE')
Now the dataframe's index is the date. This is very convenient. For example, you can do:
df.resample('Y').sum()
You should also be able to plot:
df.Total_Sales.plot()
And pandas will take care of making the x-axis linear in time, formatting the date, etc.

Load local dataset into Python 3.4 Pandas statsmodel for Time Series

I'm trying to implement Time series forecasting in Python using Pandas n statsmodels. I want to read a local excel file data and then decompose it into Trends & Seasonal components but I'm unable to get any relevant links which shows how to load data from local drive.
I tried using the following code:
excel = pandas.ExcelFile( 'PET_PRI_SPT_S1_D.xls' )
df = excel.parse( excel.sheet_names[1] )
dta = sm.datasets.co2.load_pandas().data
dta.co2.interpolate(inplace=True)
res = sm.tsa.seasonal_decompose(dta.co2)
resplot = res.plot()
res.resid
But its if I try printing dta variable, it shows some other data and not PET_PRI_SPT_S1_D.xls. Even resplot=res.plot() doesn't seem to work and no plot shows up.
Can you please guide me on how to load data from local drive into pandas dataframe.
Edit 1:
I get following when I tried df.info(). Here df is object of my excel file.
DatetimeIndex: 7509 entries, 1986-01-24 00:00:00 to 2015-06-08 00:00:00 Data columns (total 2 columns): WTI 7408 non-null object Brent 7115 non-null object dtypes: object(2) memory usage: 117.3+ KB
When I try dta.info() where dta is object of sm.datasets.co2 type, I get following.
DatetimeIndex: 2284 entries, 1958-03-29 00:00:00 to 2001-12-29 00:00:00 Freq: W-SAT Data columns (total 1 columns): co2 2284 non-null float64 dtypes: float64(1) memory usage: 35.7 KB

missing values using pandas.rolling_mean

I have lots of missing values when calculating rollng_mean with:
import datetime as dt
import pandas as pd
import pandas.io.data as web
stocklist = ['MSFT', 'BELG.BR']
# read historical prices for last 11 years
def get_px(stock, start):
return web.get_data_yahoo(stock, start)['Adj Close']
today = dt.date.today()
start = str(dt.date(today.year-11, today.month, today.day))
px = pd.DataFrame({n: get_px(n, start) for n in stocklist})
px.ffill()
sma200 = pd.rolling_mean(px, 200)
got following result:
In [14]: px
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 2270 non-null values
MSFT 2769 non-null values
dtypes: float64(2)
In [15]: sma200
Out[15]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2836 entries, 2002-01-14 00:00:00 to 2013-01-11 00:00:00
Data columns:
BELG.BR 689 non-null values
MSFT 400 non-null values
dtypes: float64(2)
Any idea why most of the sma200 rolling_mean values are missing and how to get the complete list ?
px.ffill() returns a new DataFrame. To modify px itself, use inplace = True.
px.ffill(inplace = True)
sma200 = pd.rolling_mean(px, 200)
print(sma200)
yields
Data columns:
BELG.BR 2085 non-null values
MSFT 2635 non-null values
dtypes: float64(2)
If you print sma200, you will probably find lots of null or missing values. This is because the threshold for number of non-nulls is high by default for rolling_mean.
Try using
sma200 = pd.rolling_mean(px, 200, min_periods=2)
From the pandas docs:
min_periods: threshold of non-null data points to require (otherwise result is NA)
You could also try changing the size of the window if your dataset is missing many points.

Categories

Resources