Python datetime- Get an interval of dates in a dataframe - python

I have a dataset in the following format:
date_time,open,close,...
2012-02-01,1307.25,1320.5,...
2012-02-03,1322.5,1339.5,...
....
These data are in a file called Dataset.csv. I read it as follows:
#This is the whole data. I will use it only later
self.data= pd.read_csv('./dataset/Dataset.csv')
#I will get only the indices from date_time columns. This is what I
#want now
self.sp = pd.read_csv('./dataset/Dataset.csv')
#Set index
self.sp = self.sp.set_index('date_time')
#save indices
self.sp = self.sp.index
if I print self.sp, here is what I get
Index(['2012-02-01', '2012-02-02', '2012-02-03', '2012-02-06', '2012-02-07',
'2012-02-08', '2012-02-09', '2012-02-10', '2012-02-13', '2012-02-14',
...
'2019-08-19', '2019-08-20', '2019-08-21', '2019-08-22', '2019-08-23',
'2019-08-26', '2019-08-27', '2019-08-28', '2019-08-29', '2019-08-30'],
dtype='object', name='date_time', length=1960)
I would like to get an interval of values, according to the date_time column considering a beginning date and interval of dates as the following:
#The initial date
begin=datetime.datetime(2012,2,1,0,0,0,0)
#The total number of days I will get from the dataset is 360, starting
#from the date in the
#begin variable
trainSize=datetime.timedelta(days=360*1).days
#The TrainMinLimit will be loaded as the initial date
#If the initial date cannot be used, add 1 day to the initial date and
#consider it as the initial date
trainMinLimit=None
while(trainMinLimit is None):
try:
trainMinLimit = self.sp.get_loc(begin)
except:
begin+=datetime.timedelta(1,0,0,0,0,0,0)
#The TrainMaxLimit will be loaded as the interval between the initial date plus the training
#size. If the initial date cannot be used, add 1 day to the initial date and consider it the
#initial date
trainMaxLimit=None
while(trainMaxLimit is None):
try:
trainMaxLimit = self.sp.get_loc(begin+trainSize)
except:
begin+=datetime.timedelta(1,0,0,0,0,0,0)
When I run this code, I have the following error:
trainMinLimit = self.sp.get_loc(begin)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py", line 2659, in
get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 127, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 153, in
pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index.pyx", line 170, in
pandas._libs.index.IndexEngine._maybe_get_bool_indexer
KeyError: datetime.date(2012, 2, 1)
Here Python is not understanding in the get_loc() how the begin variable format (datetime) can index the dataframe of dates. How can I use a datetime variable to get the position it is located in a dates pandas dataframe?
edit: As suggested, I tried to convert the index to datetime format as the following:
self.sp = pd.read_csv('./dataset/Dataset.csv')
self.sp = self.sp.set_index('date_time')
self.sp = pd.to_datetime(self.sp.index)
And I have the following error:
self.sp = pd.to_datetime(self.sp.index)
File "/usr/local/lib/python3.5/dist-
packages/pandas/core/tools/datetimes.py", line 603, in to_datetime
result = convert_listlike(arg, box, format)
File "/usr/local/lib/python3.5/dist-
packages/pandas/core/tools/datetimes.py", line 302, in
_convert_listlike_datetimes
allow_object=True)
File "/usr/local/lib/python3.5/dist-
packages/pandas/core/arrays/datetimes.py", line 1866, in
objects_to_datetime64ns
raise e
File "/usr/local/lib/python3.5/dist-
packages/pandas/core/arrays/datetimes.py", line 1857, in
objects_to_datetime64ns
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 460, in
pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 685, in
pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 809, in
pandas._libs.tslib.array_to_datetime_object
File "pandas/_libs/tslib.pyx", line 803, in
pandas._libs.tslib.array_to_datetime_object
File "pandas/_libs/tslibs/parsing.pyx", line 99, in
pandas._libs.tslibs.parsing.parse_datetime_string
File "/usr/local/lib/python3.5/dist-
packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/local/lib/python3.5/dist-
packages/dateutil/parser/_parser.py", line 649, in parse
raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: date_time

Related

Getting Error While Mapping Data using Dictionary

I'm reading multiple files using this code block. Sometimes the column names in the file and col_map dictionary differ and code through the error. If the column name in the file is Label/Name and the col_map value is Label/, then I will throw the error. I'm looking for a wildcard kind of approach. It can name a partial value match. Then it should map the value.
If the Column name contains Label/, it should map the values.
Errors:
File "backup.py", line 27, in
mapping_function(df)
File "backup.py", line 24, in mapping_function
_data[i] = data[col_map[i]]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 2927, in getitem
indexer = self.columns.get_loc(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Label/Studio/Network/Developer/Publisher'
import pandas as pd
df=pd.read_csv('test.txt',sep=' ')
print(df.columns)
### Columns Name #########
# Label/Name,Item Title,Quantity
col_map = {
"start_date":None,
"end_date":None,
"product_label":"Label/",
"product_title":"Item Title",
"product_sku":None,
"quantity":"Quantity"
}
def mapping_function(data):
_data = {}
for i in col_map:
if col_map[i] is not None:
_data[i] = data[col_map[i]]
mapping_function(df)

builtin keyerror while using pandas datareader to extract data

I'm using a loop to extract data by using pandas datareader, the first two loops are working properly.
But from the third loop, the code starts to return a builtin keyerror which is unexpected. i wonder since the first two loops are working properly, why from the third loop it starts to return error? and how to fix it?
import pandas as pd
import datetime as dt
import pandas_datareader as web
#====================================================
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
#############
prev=15
endDate=dt.datetime.today().date()
sDate=endDate-pd.to_timedelta(prev,unit='d')
#############
def get_price(tickers): #input is a list or Series
result=pd.DataFrame()
for i in tickers:
df=pd.DataFrame()
df['Adj Close']=web.DataReader(i,'yahoo',sDate,endDate)['Adj Close']
df['MA']=df['Adj Close'].rolling(5).mean()
df.sort_values(ascending=False,inplace=True,by="Date")
df['Higher?']=df['Adj Close']>df['MA']
df['Higher?']=df['Higher?'].astype(int)
result['{}'.format(i)]=df['Higher?']
return result
#=============================================================================
base_url = "http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportExcel?symbol="
data = {
'Ticker' : [ 'XLC','XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLRE','XLK','XLU' ]
, 'Name' : [ 'Communication Services','Consumer Discretionary','Consumer Staples','Energy','Financials','Health Care','Industrials','Materials','Real Estate','Technology','Utilities' ]
}
spdr_df = pd.DataFrame(data)
print(spdr_df)
for i, row in spdr_df.iterrows():
url = base_url + row['Ticker']
df_url = pd.read_excel(url)
header = df_url.iloc[0]
holdings_df = df_url[1:]
holdings_df.set_axis(header, axis='columns', inplace=True)
holdings_df=holdings_df['Symbol'].replace('.','-')
a=get_price(holdings_df)
print(a)
the errors are listed below:
a=get_price(holdings_df)
File "C:/Users/austi/Desktop/stock&trading/get etf holdings Main Version.py", line 25, in <module>
df['Adj Close']=web.DataReader(i,'yahoo',sDate,endDate)['Adj Close']
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas\util\_decorators.py", line 214, in wrapper
return func(*args, **kwargs)
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas_datareader\data.py", line 387, in DataReader
session=session,
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas_datareader\base.py", line 251, in read
df = self._read_one_data(self.url, params=self._get_params(self.symbols))
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas_datareader\yahoo\daily.py", line 165, in _read_one_data
prices["Date"] = to_datetime(to_datetime(prices["Date"], unit="s").dt.date)
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas\core\frame.py", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas\core\indexes\base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas\_libs\index.cp36-win32.pyd", line 111, in pandas._libs.index.IndexEngine.get_loc
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas\_libs\index.cp36-win32.pyd", line 138, in pandas._libs.index.IndexEngine.get_loc
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas\_libs\hashtable.cp36-win32.pyd", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
Array of values of which unique will be calculated
File "C:\Users\austi\Downloads\Python 3.6.3\Lib\site-packages\pandas\_libs\hashtable.cp36-win32.pyd", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
builtins.KeyError: 'Date'```
For same tickers, there is no date column.
To catch the error and continue, try this code:
def get_price(tickers): #input is a list or Series
result=pd.DataFrame()
for i in tickers:
try:
df=pd.DataFrame()
df['Adj Close']=web.DataReader(i,'yahoo',sDate,endDate)['Adj Close']
df['MA']=df['Adj Close'].rolling(5).mean()
df.sort_values(ascending=False,inplace=True,by="Date") # sometimes error
df['Higher?']=df['Adj Close']>df['MA']
df['Higher?']=df['Higher?'].astype(int)
result['{}'.format(i)]=df['Higher?']
except Exception as ex: # no date column
print('Ticker', i, 'ERROR', ex)
print(df)
return result

Calculating the duration of days between two formatted dates in Python results in "OverflowError: int too big to convert"

I have a DataFrame of 320000 rows and 18 columns.
Two of the columns are the project start date and project end date.
I simply want to add a column with the duration of the project in days.
df['proj_duration'] = df['END_FORMATED'] - df['START_FORMATED']
The data is imported from a SQL Server.
The dates are formated (yyyy-mm-dd).
When I run the code above, I get this error:
Traceback (most recent call last):
File "pandas_libs\tslibs\timedeltas.pyx", line 234, in
pandas._libs.tslibs.timedeltas.array_to_timedelta64
TypeError: Expected unicode, got datetime.timedelta
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 1, in
df['proj_duration'] = df['END_FORMATED'] - df['START_FORMATED']
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\ops\common.py",
line 64, in new_method
return method(self, other)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\ops_init_.py",
line 502, in wrapper
return _construct_result(left, result, index=left.index, name=res_name)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\ops_init_.py",
line 475, in _construct_result
out = left._constructor(result, index=index)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\series.py",
line 305, in init
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\construction.py",
line 424, in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\construction.py",
line 537, in _try_cast
subarr = maybe_cast_to_datetime(arr, dtype)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py",
line 1346, in maybe_cast_to_datetime
value = maybe_infer_to_datetimelike(value)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py",
line 1198, in maybe_infer_to_datetimelike
value = try_timedelta(v)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py",
line 1187, in try_timedelta
return to_timedelta(v)._ndarray_values.reshape(shape)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\tools\timedeltas.py",
line 102, in to_timedelta
return _convert_listlike(arg, unit=unit, errors=errors)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\tools\timedeltas.py",
line 140, in _convert_listlike
value = sequence_to_td64ns(arg, unit=unit, errors=errors, copy=False)[0]
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\arrays\timedeltas.py",
line 943, in sequence_to_td64ns
data = objects_to_td64ns(data, unit=unit, errors=errors)
File
"C:\Users\77797\Anaconda3\lib\site-packages\pandas\core\arrays\timedeltas.py",
line 1052, in objects_to_td64ns
result = array_to_timedelta64(values, unit=unit, errors=errors)
File "pandas_libs\tslibs\timedeltas.pyx", line 239, in
pandas._libs.tslibs.timedeltas.array_to_timedelta64
File "pandas_libs\tslibs\timedeltas.pyx", line 198, in
pandas._libs.tslibs.timedeltas.convert_to_timedelta64
File "pandas_libs\tslibs\timedeltas.pyx", line 143, in
pandas._libs.tslibs.timedeltas.delta_to_nanoseconds
OverflowError: int too big to convert
I suspect that there is a problem in the formatting of the dates. I tried:
a = df.head(50000)['END_FORMATED']
b = df.head(50000)['START_FORMATED']
c = a-b
and got the same error. However, when I ran it for the last 50000 rows, it worked perfectly:
x = df.tail(50000)['END_FORMATED']
y = df.tail(50000)['START_FORMATED']
z = x-y
This shows that the problem does not exist in all of the dataset and only in some of the rows.
Any idea how I can solve the problem?
Thanks!
Seems like you have a date in your SQL dataset set as 1009-01-06. pandas only understand dates between 1677-09-21 and 2262-04-11, as per this oficial documentation.
Try to cast each Series into a datetime object to catch if some entry is not in the expected format, with infer_datetime_format = True and errors = 'coerce' as follows:
df['START_FORMATED'] = ['2020-05-05', '2020-05-06', '2020-05-07', 1009-01-06]
df['END_FORMATED'] = ['2020-06-05', '2020-06-06', '2020-06-07', '2020-06-08']
df['proj_duration'] = pd.to_datetime(df['END_FORMATED'], infer_datetime_format = True, errors = 'coerce') - pd.to_datetime(df['START_FORMATED'], infer_datetime_format=True, errors = 'coerce')
This will set NaT value when impossible to use pd.to_datetime(), which resulted in this df:
START_FORMATED END_FORMATED proj_duration
0 2020-05-05 2020-06-05 31 days
1 2020-05-06 2020-06-06 31 days
2 2020-05-07 2020-06-07 31 days
3 1009-01-06 2020-06-08 NaT

"KeyError 0.0" when using another dataframe containing standard deviation as error bars for the y axis

I have some code that I am unable to figure out the solution to (despite much googling and searching).
Basically, I have a scipt that compiles two files derived from the same data - one file is a table of the average at each wavelength for six groups, and the other is the exact same but containing the standard deviation (both tables are the same size, same headers, etc). Plotting the files (df2) for the average produces graphs, no problem.
However when I try to use the dataframe containing the standard deviation (df3) values, as the error bars for the graph, I get "KeyError 0.0".
Is anyone able to help me figure out what's going wrong? Much appreciated and many thanks.
My script is below.
wave = pd.read_csv("Wavelength_to_Index_Conversion.csv") #contains a list of wavelengths used to use later on
##
#GET AVERAGE VALUES OF DATA FROM EACH FILE
##
for filename in os.listdir(indir):
print (i)
df1 = pd.read_csv(indir + filename) #imports file
print(filename)
filename = filename[:-12] #removes extra filename and extension
print(filename)
if i == 0:
df2 = pd.DataFrame(data=None, columns=None,index=df1.index)
e = df1['Average']
df2 = df2.assign(temp=e.values)
print (filename)
df2.rename(columns={'temp': str(filename)}, inplace=True)
i = i+1
df2.set_index(wave['Wavelength'], inplace=True)
df2.to_csv("Total_Average.csv")
df2t = df2.T
df2t.to_csv("Total_Average_T.csv")
##
## GET STD DEV
##
i = 0
for filename in os.listdir(indir):
print (i)
df1 = pd.read_csv(indir + filename) #imports file
print(filename)
filename = filename[:-12]
print(filename)
if i == 0:
df3 = pd.DataFrame(data=None, columns=None,index=df1.index)
e = df1['Std Dev']
df3 = df3.assign(temp=e.values)
print (filename)
df3.rename(columns={'temp': str(filename)}, inplace=True)
i = i+1
df3.set_index(wave['Wavelength'], inplace=True)
df3.to_csv("Total_Std Dev.csv")
#plt.rc('font', family='serif', size=13)
#plt.rc('font', family='san-serif', size=13)
#plt.ylim(0, 60)
#plt.xlim(400, 900)
plt.title('Title',fontsize=20)
plt.xlabel('Wavelength (nm)',fontsize=15)
plt.ylabel('Reflectance (%)',fontsize=15)
#plt.autoscale(enable=True, axis=u'both', tight=False)
plt.grid(False)
df2.plot(x = wave['Wavelength'], xlim=(400,850), ylim=(0,60), yerr=df3) #ERROR HERE WHENEVER yerr=df3 IS USED
EDIT:
Here's the error I'm given.
Traceback (most recent call last):
File "<ipython-input-8-d121c6b80d76>", line 1, in <module>
runfile('C:/Users/user/Documents/PhD/Data Manipulation and Graphs/Spectrometry/Plotting All Files into a Graph [X]/Plotting Spectral.py', wdir='C:/Users/user/Documents/PhD/Data Manipulation and Graphs/Spectrometry/Plotting All Files into a Graph [X]')
File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/jdstam/Documents/PhD/Data Manipulation and Graphs/Spectrometry/Plotting All Files into a Graph [X]/Plotting Spectral.py", line 115, in <module>
df2.plot(x = wave['Wavelength'], xlim=(400,850), ylim=(0,60), yerr=df3)
File "C:\Anaconda3\lib\site-packages\pandas\tools\plotting.py", line 3671, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Anaconda3\lib\site-packages\pandas\tools\plotting.py", line 2556, in plot_frame
**kwds)
File "C:\Anaconda3\lib\site-packages\pandas\tools\plotting.py", line 2384, in _plot
plot_obj.generate()
File "C:\Anaconda3\lib\site-packages\pandas\tools\plotting.py", line 987, in generate
self._make_plot()
File "C:\Anaconda3\lib\site-packages\pandas\tools\plotting.py", line 1664, in _make_plot
**kwds)
File "C:\Anaconda3\lib\site-packages\pandas\tools\plotting.py", line 1678, in _plot
lines = MPLPlot._plot(ax, x, y_values, style=style, **kwds)
File "C:\Anaconda3\lib\site-packages\pandas\tools\plotting.py", line 1293, in _plot
return ax.errorbar(x, y, **kwds)
File "C:\Anaconda3\lib\site-packages\matplotlib\__init__.py", line 1811, in inner
return func(ax, *args, **kwargs)
File "C:\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py", line 2963, in errorbar
iterable(yerr[0]) and len(yerr[0]) > 1))):
File "C:\Anaconda3\lib\site-packages\pandas\core\series.py", line 557, in __getitem__
result = self.index.get_value(self, key)
File "C:\Anaconda3\lib\site-packages\pandas\core\index.py", line 3884, in get_value
loc = self.get_loc(k)
File "C:\Anaconda3\lib\site-packages\pandas\core\index.py", line 3942, in get_loc
tolerance=tolerance)
File "C:\Anaconda3\lib\site-packages\pandas\core\index.py", line 1759, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:3979)
File "pandas\index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas\index.c:3843)
File "pandas\hashtable.pyx", line 498, in pandas.hashtable.Float64HashTable.get_item (pandas\hashtable.c:9556)
File "pandas\hashtable.pyx", line 504, in pandas.hashtable.Float64HashTable.get_item (pandas\hashtable.c:9494)
KeyError: 0.0
EDI2:
i. Output for df2.head()
Mon18th_C24_20% Mon18th_C24_50% Mon18th_C24_80% Mon18th_Col0_20% Mon18th_Col0_50% Mon18th_Col0_80%
Wavelength
341.514 -1.963704 32.036429 13.413667 -1.396250 39.647 -23.480313
341.892 -1.963704 32.036429 13.413667 -1.396250 39.647 -23.480313
342.269 11.653704 -44.896786 180.915667 -6.503438 38.840 50.237812
342.646 18.801111 305.569286 532.147000 -27.857500 504.351 66.496562
343.024 28.421852 48.280000 65.551000 968.713437 79.782 63.419688
ii. Output for df3.head()
Mon18th_C24_20% Mon18th_C24_50% Mon18th_C24_80% Mon18th_Col0_20% Mon18th_Col0_50% Mon18th_Col0_80%
Wavelength
341.514 103.027866 121.768024 98.039234 60.476486 101.865202 325.329848
341.892 103.027866 121.768024 98.039234 60.476486 101.865202 325.329848
342.269 59.441182 719.315230 643.463142 52.176211 606.273169 177.225557
342.646 41.717983 820.680259 922.506270 73.360823 689.563817 93.132120
343.024 47.129664 75.400200 73.011963 1728.405146 71.020783 52.873068

TimeSeries plots with Bokeh

NOTE FROM BOKEH MAINTAINER: The bokeh.charts API including TimeSeries was deprecated and removed a long time ago. This question is not relevant as-is to any recent or future versions of Bokeh. To plot time series, use the stable and supported bokeh.plotting API. Some examples can be found here.
I am trying to plot a Timeseries plot with categories.
xaxis_values: startTIme
yaxis_values: count
groupby: day
Every day has 24 hours data sets and like this the entire dataset has more than 100 days I am trying to have few type of plots.
Group by day and sum all the counts of every hours from startTime
which will give 7 time series plots in one graph.
Separate by day i.e. every mon, tue, wed and so on whatever the number of days n and plot a 24 hrs time series.
Group by hours irrespective of days. i.e. 00:00:00, 01:00:00 and so on
What is the best way to get the better visualization with bokeh or seaborn .
Input:
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254
Code: Reference: Here
import numpy as np
import pandas as pd
from bokeh.charts import TimeSeries, show, output_file, vplot
output_file("timeseries.html")
data_one = pd.read_csv('one_hour.csv')
data_one.columns = ['date', 'startTime', 'endTime', 'day', 'count', 'unique']
data = dict(data_one=data_one['count'])
tsline = TimeSeries(data,
x='startTime', y='count',
color=['day'], title="Timeseries", ylabel='count', legend=True)
show(vplot(tsline))
Error:
Traceback (most recent call last):
File "date_graph.py", line 10, in <module>
data = dict(data_one=data_one['count'])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'count
Edit: After changing
data = dict(data_one=data_one['count'].tolist())
Error:
Traceback (most recent call last):
File "date_graph.py", line 12, in <module>
tsline = TimeSeries(data, x='startTime', y='count', color=['startTime'], title="Timeseries", ylabel='count', legend=True)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builders/timeseries_builder.py", line 102, in TimeSeries
return create_and_build(builder_type, data, **kws)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builder.py", line 67, in create_and_build
chart.add_builder(builder)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/chart.py", line 149, in add_builder
builder.create(self)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builder.py", line 518, in create
chart.add_renderers(self, renderers)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/chart.py", line 144, in add_renderers
self.renderers += renderers
File "/usr/local/lib/python2.7/dist-packages/bokeh/core/property_containers.py", line 18, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/bokeh/core/property_containers.py", line 77, in __iadd__
return super(PropertyValueList, self).__iadd__(y)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builders/line_builder.py", line 230, in yield_renderers
x=group.get_values(self.x.selection),
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/data_source.py", line 173, in get_values
return self.data[selection]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'startTime'

Categories

Resources