Scatterplot of pandas DataFrame ends in KeyError: 0 - python

After I updated pandas (0.23.4) and matplotlib (3.01) I get a strange error trying to do something like the following:
import pandas as pd
import matplotlib.pyplot as plt
clrdict = {1: "#a6cee3", 2: "#1f78b4", 3: "#b2df8a", 4: "#33a02c"}
df_full = pd.DataFrame({'x':[20,30,30,40],
'y':[25,20,30,25],
's':[100,200,300,400],
'l':[1,2,3,4]})
df_full['c'] = df_full['l'].replace(clrdict)
df_part = df_full[(df_full.x == 30)]
fig = plt.figure()
plt.scatter(x=df_full['x'],
y=df_full['y'],
s=df_full['s'],
c=df_full['c'])
plt.show()
fig = plt.figure()
plt.scatter(x=df_part['x'],
y=df_part['y'],
s=df_part['s'],
c=df_part['c'])
plt.show()
The scatterplot of the original DataFrame (df_full) is shown without problems. But the plot of the partially DataFrame raises the following error:
Traceback (most recent call last):
File "G:\data\project\test.py", line 27, in <module>
c=df_part['c'])
File "C:\Program Files\Python37\lib\site-packages\matplotlib\pyplot.py", line 2864, in scatter
is not None else {}), **kwargs)
File "C:\Program Files\Python37\lib\site-packages\matplotlib\__init__.py", line 1805, in inner
return func(ax, *args, **kwargs)
File "C:\Program Files\Python37\lib\site-packages\matplotlib\axes\_axes.py", line 4195, in scatter
isinstance(c[0], str))):
File "C:\Program Files\Python37\lib\site-packages\pandas\core\series.py", line 767, in __getitem__
result = self.index.get_value(self, key)
File "C:\Program Files\Python37\lib\site-packages\pandas\core\indexes\base.py", line 3118, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 114, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 964, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
This is due to the color-option c=df_part['c']. When you leave it out – the problem doesn't occur. This hasn't happend before the updates, so maybe you're not able to reproduce this with lower versions of matplotlib or pandas (I have no idea which one causes it).
In my project the df_part = df_full[(df_full.x == i)] line is used within the update-function of a matplotlib.animation.FuncAnimation. The result is an animation over the values of x (which are timestamps in my project). So I need a way to part the DataFrame.

This is a bug which got fixed by https://github.com/matplotlib/matplotlib/pull/12673.
It should hopefully be available in the next bugfix release 3.0.2, which should be up within the next days.
In the meantime, you may use the numpy array from the pandas series, series.values.

Related

How to add custom line to BoxWhisker holoviews plot?

I need to add a horizontal line to the boxplot. Looked through holoviews manual and seems that HLine is supposed to be used in such case. Unfortunately I get an error:
ValueError: all the input arrays must have same number of dimensions
Example:
import numpy as np
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
groups = [chr(65+g) for g in np.random.randint(0, 3, 200)]
boxwhisker = hv.BoxWhisker(
(groups, np.random.randint(0, 5, 200), np.random.randn(200)),
['Group', 'Category'],
'Value'
).sort() * hv.HLine(1)
boxwhisker.opts(
opts.BoxWhisker(
box_color='white',
height=400,
show_legend=False,
whisker_color='gray',
width=600
),
opts.HLine(color='green', line_width=2)
)
layout = hv.Layout(boxwhisker)
hv.save(layout, 'boxplot.html')
Traceback:
File "/home/python3.6/site-packages/holoviews/plotting/renderer.py", line 545, in save
plot = self_or_cls.get_plot(obj)
File "/home/python3.6/site-packages/holoviews/plotting/bokeh/renderer.py", line 135, in get_plot
plot = super(BokehRenderer, self_or_cls).get_plot(obj, renderer, **kwargs)
File "/home/python3.6/site-packages/holoviews/plotting/renderer.py", line 207, in get_plot
plot.update(init_key)
File "/home/python3.6/site-packages/holoviews/plotting/plot.py", line 595, in update
return self.initialize_plot()
File "/home/python3.6/site-packages/holoviews/plotting/bokeh/plot.py", line 995, in initialize_plot
subplots = subplot.initialize_plot(ranges=ranges, plots=shared_plots)
File "/home/python3.6/site-packages/holoviews/plotting/bokeh/plot.py", line 1115, in initialize_plot
adjoined_plots.append(subplot.initialize_plot(ranges=ranges, plots=passed_plots))
File "/home/python3.6/site-packages/holoviews/plotting/bokeh/element.py", line 2058, in initialize_plot
self._update_ranges(element, ranges)
File "/home/python3.6/site-packages/holoviews/plotting/bokeh/element.py", line 747, in _update_ranges
xfactors, yfactors = self._get_factors(element, ranges)
File "/home/python3.6/site-packages/holoviews/plotting/bokeh/element.py", line 2031, in _get_factors
xfactors = np.concatenate(xfactors)
ValueError: all the input arrays must have same number of dimensions
Yes, HLine is the right way to do this, but unfortunately the support for categorical axes in HoloViews is currently limited and will not allow such an overlay. There's been some work on an unfinished alternative implementation for categorical axes that would fix this. In the meantime, I'd assume you could add a custom hook, but that would be awkward to figure out.
Note that hv.Layout(boxwhisker) won't work as you have it above, in any case; it would need to be hv.Layout([boxwhisker]) or just boxwhisker (as Layout takes a list, but here you don't even need a layout since you have only one item).

Having Issues with an AssertionError when trying to use the psd() command in matplotlib

I'm trying to write a short script that takes a .csv file with some distance data, and outputs the psd file for it. the code is here:
import math
import matplotlib.pyplot as plt
name = raw_input('File:')
data = open(name + '.csv', 'r')
distances = []
for row in data:
distances.append(row.replace("\n",""))
for i in range(len(distances)):
distances[i] = float(distances[i])
Pxx, freqs = plt.psd(distances, NFFT=16,Fs=2,detrend='detrend_mean',window='window_none',noverlap=128,sides='onesided',scale_by_freq=True)
plot(Pxx,freqs)
plt.savefig(name + 'psd.png', bbox_inches = 'tight')
As you can see, it's pretty simple. the csv file just features one column of numbers, so distances is a vector.
The error I'm getting is as follows:
Traceback (most recent call last):
File "C:psdplot.py", line 15, in <module>
Pxx, freqs = plt.psd(distances, NFFT=16,Fs=2,detrend='detrend_mean',window='window_none',noverlap=128,sides='onesided',scale_by_freq=True)
File "C:\Python27\lib\site-packages\matplotlib\pyplot.py", line 3029, in psd
sides=sides, scale_by_freq=scale_by_freq, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 8696, in psd
sides, scale_by_freq)
File "C:\Python27\lib\site-packages\matplotlib\mlab.py", line 389, in psd
scale_by_freq)
File "C:\Python27\lib\site-packages\matplotlib\mlab.py", line 423, in csd
noverlap, pad_to, sides, scale_by_freq)
File "C:\Python27\lib\site-packages\matplotlib\mlab.py", line 251, in _spectral_helper
assert(len(window) == NFFT)
AssertionError
Could someone direct me on how to fix this? I'm sure it's rather obvious, but I haven't been able to find anything on fixing it in this particular context.
Thanks in advance!

Plotting data from csv using matplotlib.pyplot

I am trying to follow a tutorial on youtube, now in the tutorial they plot some standard text files using matplotlib.pyplot, I can achieve this easy enough, however I am now trying to perform the same thing using some csvs I have of real data.
The code I am using is import matplotlib.pyplot as plt
import csv
#import numpy as np
with open(r"Example RFI regression axis\Delta RFI.csv") as x, open(r"Example RFI regression axis\strikerate.csv") as y:
readx = csv.reader(x)
ready = csv.reader(y)
plt.plot(readx,ready)
plt.title ('Test graph')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()
The traceback I receive is long
Traceback (most recent call last):
File "C:\V4 code snippets\matplotlib_test.py", line 11, in <module>
plt.plot(readx,ready)
File "C:\Python27\lib\site-packages\matplotlib\pyplot.py", line 2832, in plot
ret = ax.plot(*args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 3997, in plot
self.add_line(line)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 1507, in add_line
self._update_line_limits(line)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 1516, in _update_line_limits
path = line.get_path()
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 677, in get_path
self.recache()
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 401, in recache
x = np.asarray(xconv, np.float_)
File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 320, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
Please advise what I need to do, I realise this is probably very easy to most seasoned coders. Kind regards SMNALLY
csv.reader() returns strings (technically, .next()method of reader object returns lists of strings). Without converting them to float or int, you won't be able to plt.plot() them.
To save the trouble of converting, I suggest using genfromtxt() from numpy. (http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html)
For example, there are two files:
data1.csv:
data1
2
3
4
3
6
6
4
and data2.csv:
data2
92
73
64
53
16
26
74
Both of them have one line of header. We can do:
import numpy as np
data1=np.genfromtxt('data1.csv', skip_header=1) #suppose it is in the current working directory
data2=np.genfromtxt('data2.csv', skip_header=1)
plt.plot(data1, data2,'o-')
and the result:

Calculate logistic regression in python

I tried to calculate logical regression. I have the data as csv file.
it looks like
node_id,second_major,gender,major_index,year,dorm,high_school,student_fac
0,0,2,257,2007,111,2849,1
1,0,2,271,2005,0,51195,2
2,0,2,269,2007,0,21462,1
3,269,1,245,2008,111,2597,1
..........................
This is my coding.
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("Reed98.csv")
print df.describe()
dummy_ranks = pd.get_dummies(df['second_major'], prefix='second_major')
cols_to_keep = ['second_major', 'dorm', 'high_school']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'year':])
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
logit = sm.Logit(data['second_major'], data[train_cols])
result = logit.fit()
print result.summary()
When I run the coding in python I got an error:
Traceback (most recent call last):
File "D:\project\logisticregression.py", line 24, in <module>
result = logit.fit()
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 282, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 233, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\base\model.py", line 291, in fit
hess=hess)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6-win32.egg\statsmodels\base\model.py", line 341, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 328, in solve
raise LinAlgError('Singular matrix')
LinAlgError: Singular matrix
How to rewrite the code?
There's nothing wrong with your code. My guess is that you have missing values in your data. Try a dropna or use missing='drop' to Logit. You might also check that the right hand side is full rank np.linalg.matrix_rank(data[train_cols].values)

date2num , ValueError: ordinal must be >= 1

I'm using the matplotlib candlestick module which requires the time to be passed as a float day format . I`m using date2num to convert it, before :
This is my code :
import csv
import sys
import math
import numpy as np
import datetime
from optparse import OptionParser
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import matplotlib.mlab as mlab
import matplotlib.dates as mdates
from matplotlib.finance import candlestick
from matplotlib.dates import date2num
datafile = 'historical_data/AUD_Q10_1D_500.csv'
print 'loading', datafile
r = mlab.csv2rec(datafile, delimiter=';')
quotes = [date2num(r['date']),r['open'],r['close'],r['max'],r['min']]
candlestick(ax, quotes, width=0.6)
plt.show()
( here is the csv file : http://db.tt/MIOqFA0 )
This is what the doc says :
candlestick(ax, quotes,
width=0.20000000000000001,
colorup='k', colordown='r', alpha=1.0)
quotes is a list of (time, open,
close, high, low, ...) tuples. As
long as the first 5 elements of the
tuples are these values, the tuple
can be as long as you want (eg it may
store volume).
time must be in float days format - see date2num
Here is the full error log :
Traceback (most recent call last):
File
"/usr/lib/python2.6/site-packages/matplotlib/backends/backend_qt4agg.py",
line 83, in paintEvent
FigureCanvasAgg.draw(self) File
"/usr/lib/python2.6/site-packages/matplotlib/backends/backend_agg.py",
line 394, in draw
self.figure.draw(self.renderer) File
"/usr/lib/python2.6/site-packages/matplotlib/artist.py",
line 55, in draw_wrapper draw(artist,
renderer, *args, **kwargs) File
"/usr/lib/python2.6/site-packages/matplotlib/figure.py",
line 798, in draw func(*args) File
"/usr/lib/python2.6/site-packages/matplotlib/artist.py",
line 55, in draw_wrapper draw(artist,
renderer, *args, **kwargs) File
"/usr/lib/python2.6/site-packages/matplotlib/axes.py", line 1946, in draw a.draw(renderer)
File
"/usr/lib/python2.6/site-packages/matplotlib/artist.py",
line 55, in draw_wrapper draw(artist,
renderer, *args, **kwargs) File
"/usr/lib/python2.6/site-packages/matplotlib/axis.py", line 971, in draw tick_tups = [ t for
t in self.iter_ticks()] File
"/usr/lib/python2.6/site-packages/matplotlib/axis.py", line 904, in iter_ticks majorLocs =
self.major.locator() File
"/usr/lib/python2.6/site-packages/matplotlib/dates.py",
line 743, in __call__ self.refresh()
File
"/usr/lib/python2.6/site-packages/matplotlib/dates.py",
line 752, in refresh dmin, dmax =
self.viewlim_to_dt() File
"/usr/lib/python2.6/site-packages/matplotlib/dates.py",
line 524, in viewlim_to_dt return
num2date(vmin, self.tz),
num2date(vmax, self.tz) File
"/usr/lib/python2.6/site-packages/matplotlib/dates.py",
line 289, in num2date if not
cbook.iterable(x): return
_from_ordinalf(x, tz) File "/usr/lib/python2.6/site-packages/matplotlib/dates.py",
line 203, in _from_ordinalf dt =
datetime.datetime.fromordinal(ix)
ValueError: ordinal must be >= 1
If I run a quick :
for x in r['date']:
print str(x) + "is :" + str(date2num(x))
it outputs something like :
2010-06-12is :733935.0
2010-07-12is :733965.0
2010-08-12is :733996.0
which sound ok to me :)
Read the docstring a bit more carefully :)
quotes is a list of (time, open, close, high, low, ...) tuples.
What's happening is that it expects each item of quotes to be a sequence of (time, open, close, high, low).
You're passing in 5 long arrays, it expects a long sequence of 5 items.
You just need to zip your input.
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib.finance import candlestick
from matplotlib.dates import date2num
datafile = 'Downloads/AUD_Q10_1D_500.csv'
r = mlab.csv2rec(datafile, delimiter=';')
quotes = zip(date2num(r['date']),r['open'],r['close'],r['max'],r['min'])
fig, ax = plt.subplots()
candlestick(ax, quotes, width=0.6)
plt.show()
Seems like you're passing it a float. And in the error message you provide (full message next time please!) it appears that matplotlib is simply delegating the conversion to datetime.datetime.fromordinal.
I don't have a Python 3 installation to test this with, but when I tried to convert a float to a datetime object using datetime.datetime.fromordinal in 2.6, I got a deprecation warning. Then I tried it on ideone and got this:
Traceback (most recent call last):
File "prog.py", line 2, in <module>
print(datetime.datetime.fromordinal(5.5))
TypeError: integer argument expected, got float
So perhaps it's choking on the float.
I think your problem is here:
r = mlab.csv2rec(datafile, delimiter=';')
You need to skip the first line of the csv, which means you need:
r = mlab.csv2rec(datafile, delimiter=';', skiprows=1)
Technically this is incorrect, Ubuntu has an older version of the library, and the OP's version has the two lines below, but it was my original answer
I would make sure you're using the most recent version of matplotlib.
So that I could reproduce this issue, I downloaded and installed the latest version and I noticed that the line number of the offending piece of code had been changed to 179. I also noticed that the value is cast to int immediately before fromordinal is called (this gives a lot of credence to senderle's answer).
(line 178-179 of most recent matplotlib in Ubuntu repository)
ix = int(x)
dt = datetime.datetime.fromordinal(ix)
If upgrading is not an option, then you should cast to an int first.

Categories

Resources