Creating new df columns via iteration - python

I have a dataframe, df which looks like this
Open High Low Close Volume
Date
2007-03-22 2.65 2.95 2.64 2.86 176389
2007-03-23 2.87 2.87 2.78 2.78 63316
2007-03-26 2.83 2.83 2.51 2.52 54051
2007-03-27 2.61 3.29 2.60 3.28 589443
2007-03-28 3.65 4.10 3.60 3.80 1114659
2007-03-29 3.91 3.91 3.33 3.57 360501
2007-03-30 3.70 3.88 3.66 3.71 185787
I'm trying to create a new column, which will takes the df.Open value 5 days ahead from each df.Open value and subtract it.
So the loop I"m using is this:
for i in range(0, len(df.Open)): #goes through indexes values
df['5days'][i]=df.Open[i+5]-df.Open[i] #I use those index values to locate
However, this loop is yielding an error.
KeyError: '5days'
Not sure why. I got this to temporarily work by removing the df['5days'][i], but it seems awfully slow. Not sure if there is a more efficient way to do this.
Thank you.

Using diff
df['5Days'] = df.Open.diff(5)
print(df)
Open High Low Close Volume 5Days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
However, per your code, you may want to look ahead and align the results back. In that case
df['5Days'] = -df.Open.diff(-5)
print(df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN

I think you need shift with sub:
df['5days'] = df.Open.shift(5).sub(df.Open)
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 -1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 -0.83
Or maybe need substract Open with shifted column:
df['5days'] = df.Open.sub(df.Open.shift(5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
df['5days'] = -df.Open.sub(df.Open.shift(-5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN

Related

Scikit-learn regression on two variables given a 2D matrix of reference values

I have a matrix of reference values and would like to learn how Scikit-learn can be used to generate a regression model for it. I have done several types of univariate regressions in the past but it's not clear to me how to use two variables in sklearn.
I have two features (A and B) and a table of output values for certain input A/B values. See table and 3D surface below. I'd like to see how I can translate this to a two variable equation that relates the A/B inputs to the single value output, like shown in the table. The relationship looks nonlinear and it could also be quadratic, logarithmic, etc...
How do I use sklearn to perform a nonlinear regression on this tabular data?
A/B 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
0 8.78 8.21 7.64 7.07 6.50 5.92 5.35 4.78 4.21 3.63 3.06
5 8.06 7.56 7.07 6.58 6.08 5.59 5.10 4.60 4.11 3.62 3.12
10 7.33 6.91 6.50 6.09 5.67 5.26 4.84 4.43 4.01 3.60 3.19
15 6.60 6.27 5.93 5.59 5.26 4.92 4.59 4.25 3.92 3.58 3.25
20 5.87 5.62 5.36 5.10 4.85 4.59 4.33 4.08 3.82 3.57 3.31
25 5.14 4.97 4.79 4.61 4.44 4.26 4.08 3.90 3.73 3.55 3.37
30 4.42 4.32 4.22 4.12 4.02 3.93 3.83 3.73 3.63 3.53 3.43
35 3.80 3.78 3.75 3.72 3.70 3.67 3.64 3.62 3.59 3.56 3.54
40 2.86 2.93 2.99 3.05 3.12 3.18 3.24 3.31 3.37 3.43 3.50
45 2.08 2.24 2.39 2.54 2.70 2.85 3.00 3.16 3.31 3.46 3.62
50 1.64 1.84 2.05 2.26 2.46 2.67 2.88 3.08 3.29 3.50 3.70
55 1.55 1.77 1.98 2.19 2.41 2.62 2.83 3.05 3.26 3.47 3.69
60 2.09 2.22 2.35 2.48 2.61 2.74 2.87 3.00 3.13 3.26 3.39
65 3.12 3.08 3.05 3.02 2.98 2.95 2.92 2.88 2.85 2.82 2.78
70 3.50 3.39 3.28 3.17 3.06 2.95 2.84 2.73 2.62 2.51 2.40
75 3.42 3.32 3.21 3.10 3.00 2.89 2.78 2.68 2.57 2.46 2.36
80 3.68 3.55 3.43 3.31 3.18 3.06 2.94 2.81 2.69 2.57 2.44
85 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
90 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
95 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
100 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
There is probably a succinct nonlinear relationship between your A, B, and table values, but without some knowledge of this system nor with any sophisticated nonlinear modeling, here's a ridiculous model with a decent score.
the_table = """1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
0 8.78 8.21 7.64 7.07 6.50 5.92 5.35 4.78 4.21 3.63 3.06
5 8.06 7.56 7.07 6.58 6.08 5.59 5.10 4.60 4.11 3.62 3.12
10 7.33 6.91 6.50 6.09 5.67 5.26 4.84 4.43 4.01 3.60 3.19
15 6.60 6.27 5.93 5.59 5.26 4.92 4.59 4.25 3.92 3.58 3.25
20 5.87 5.62 5.36 5.10 4.85 4.59 4.33 4.08 3.82 3.57 3.31
25 5.14 4.97 4.79 4.61 4.44 4.26 4.08 3.90 3.73 3.55 3.37
30 4.42 4.32 4.22 4.12 4.02 3.93 3.83 3.73 3.63 3.53 3.43
35 3.80 3.78 3.75 3.72 3.70 3.67 3.64 3.62 3.59 3.56 3.54
40 2.86 2.93 2.99 3.05 3.12 3.18 3.24 3.31 3.37 3.43 3.50
45 2.08 2.24 2.39 2.54 2.70 2.85 3.00 3.16 3.31 3.46 3.62
50 1.64 1.84 2.05 2.26 2.46 2.67 2.88 3.08 3.29 3.50 3.70
55 1.55 1.77 1.98 2.19 2.41 2.62 2.83 3.05 3.26 3.47 3.69
60 2.09 2.22 2.35 2.48 2.61 2.74 2.87 3.00 3.13 3.26 3.39
65 3.12 3.08 3.05 3.02 2.98 2.95 2.92 2.88 2.85 2.82 2.78
70 3.50 3.39 3.28 3.17 3.06 2.95 2.84 2.73 2.62 2.51 2.40
75 3.42 3.32 3.21 3.10 3.00 2.89 2.78 2.68 2.57 2.46 2.36
80 3.68 3.55 3.43 3.31 3.18 3.06 2.94 2.81 2.69 2.57 2.44
85 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
90 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
95 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
100 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69"""
import numpy as np
import pandas as pd
import io
df = pd.read_csv(io.StringIO(initial_value=the_table), sep="\s+")
df.columns = df.columns.astype(np.uint64)
df_unstacked = df.unstack()
X = df_unstacked.index.tolist()
y = df_unstacked.to_list()
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(6)
poly_X = poly.fit_transform(X)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(poly_X,y)
print(model.score(poly_X,y))
# 0.9762180339233807
To predict values given an A and a B using this model, you need to transform the input like was done for creating the model. So for example,
model.predict(poly.transform([(1534,56)]))
# array([2.75275659])
Even more outrageously ridiculous ...
more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in X]
poly = PolynomialFeatures(5)
poly_X = poly.fit_transform(more_X)
model.fit(poly_X,y)
print(model.score(poly_X,y))
# 0.9982994398684035
... and to predict:
more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in [(1534,56)]]
model.predict(poly.transform(more_X))
# array([2.74577017])
N.B.: There's probably better ways to Pythonically program these ridiculous models.

Plotting Contour plot for a dataframe with x axis as datetime and y axis as depth

I have a dataframe with the indexes as datetime and columns as depths. I would like to plot a contour plot which looks something like the image below. Any ideas how I should go about doing this? I tried using the plt.contour() function but I think I have to sort out the arrays for the data first. I am unsure about this part.
Example of my dataframe:
Datetime -1.62 -2.12 -2.62 -3.12 -3.62 -4.12 -4.62 -5.12
2019-05-24 15:45:00 4.61 5.67 4.86 3.91 3.35 3.07 3.03 2.84
2019-05-24 15:50:00 3.76 4.82 4.13 3.32 2.84 2.40 2.18 1.89
2019-05-24 15:55:00 3.07 3.77 3.23 2.82 2.41 2.21 1.93 1.81
2019-05-24 16:00:00 2.50 2.95 2.63 2.29 1.97 1.73 1.57 1.48
2019-05-24 16:05:00 2.94 3.62 3.23 2.82 2.62 2.31 2.01 1.81
2019-05-24 16:10:00 3.07 3.77 3.23 2.82 2.51 2.31 2.10 1.89
2019-05-24 16:15:00 2.71 3.20 2.86 2.70 2.51 2.31 2.18 1.97
2019-05-24 16:20:00 2.50 3.07 2.86 2.82 2.73 2.50 2.37 2.22
2019-05-24 16:25:00 2.40 3.20 3.10 2.93 2.73 2.50 2.57 2.84
2019-05-24 16:30:00 2.21 2.95 2.86 2.70 2.73 2.72 2.91 3.49
2019-05-24 16:35:00 2.04 2.72 2.63 2.59 2.62 2.72 3.03 3.35
2019-05-24 16:40:00 1.73 2.31 2.33 2.39 2.62 2.95 3.57
Example of the plot I want:
For the X Y Z input in plt.contour(), I would like to find out what structure of data it requires. It says it requires a 2D array structure, but I am confused. How do I get that with my current dataframe?
I have worked out a solution. Note that the X (tt2) - time input, and Y(depth) - depth input, have to match the Z(mat2) matrix dimensions for the plt.contourf to work. I realised plt.contourf produces the image i want rather than plt.contour, which only plots the contour lines.
Example of my code:
tt2 = [...]
depth = [...]
plt.title('SSC Contour Plot')
fig=plt.contourf(tt2,depth,mat2,cmap='jet',levels=
[0,2,4,6,8,10,12,14,16,18,20,22,24,26], extend= "both")
plt.gca().invert_yaxis() #to flip the depth from shallowest to deepest
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%d/%m/%Y %H:%M:%S'))
#plt.gca().xticklabels(tt)
cbar = plt.colorbar()
cbar.set_label("mg/l")
yy = len(colls2)
plt.ylim(yy-15,0) #assumming last 10 depth readings have NaN
plt.xlabel("Datetime")
plt.xticks(rotation=45)
plt.ylabel("Depth (m)")
plt.savefig(path+'SSC contour plot.png') #save plot
plt.show()
Example of plot produced

Creating a heatmap using python and csv file

I'm trying to create a heatmap, with the x axis being time, the y axis being detectors (it's for freeway speed detection), and the colour scheme and numbers on the graph being for occupancy or basically what values the csv has at that time and detector.
My first thought is to use matplotlib in conjunction with pandas and numpy.
I've been trying lots of different approaches and feel like i've hit a brickwall in terms of getting it working.
Does anyone have a good idea about using these tools?
Cheers!
Row Labels 14142OB_L1 14142OB_L2 14140OB_E1P0 14140OB_E1P1 14140OB_E2P0 14140OB_E2P1 14140OB_L1 14140OB_L2 14140OB_M1P0 14140OB_M1P1 14140OB_M2P0 14140OB_M2P1 14140OB_M3P0 14140OB_M3P1 14140OB_S1P0 14140OB_S1P1 14140OB_S2P0 14140OB_S2P1 14140OB_S3P0 14140OB_S3P1 14138OB_L1 14138OB_L2 14138OB_L3 14136OB_L1 14136OB_L2 14136OB_L3 14134OB_L1 14134OB_L2 14134OB_L3 14132OB_L1 14132OB_L2 14132OB_L3
00 - 01 hr 0.22 1.42 0.29 0.29 0.59 0.59 0.17 1.47 0.38 0.38 0.56 0.6 0.08 0.1 0.67 0.7 0.88 0.9 0.15 0.17 0.17 1.66 0.47 0.16 1.6 0.49 0.14 0.94 1.21 0.21 1.22 0.44
01 - 02 hr 0.08 0.77 0.08 0.07 0.24 0.24 0.1 0.73 0.08 0.09 0.21 0.23 0.05 0.06 0.21 0.23 0.29 0.29 0.1 0.1 0.08 0.83 0.17 0.1 0.77 0.18 0.08 0.4 0.57 0.07 0.64 0.18
02 - 03 hr 0.08 0.73 0.06 0.06 0.23 0.23 0.06 0.73 0.07 0.07 0.23 0.24 0.02 0.02 0.16 0.17 0.32 0.34 0.06 0.07 0.06 0.77 0.16 0.06 0.78 0.17 0.07 0.3 0.66 0.06 0.68 0.19
03 - 04 hr 0.05 0.85 0.06 0.06 0.22 0.23 0.04 0.86 0.05 0.05 0.2 0.21 0.1 0.11 0.11 0.12 0.32 0.33 0.15 0.16 0.03 0.93 0.14 0.03 0.89 0.15 0.03 0.41 0.61 0.02 0.73 0.21
04 - 05 hr 0.13 1.25 0.09 0.09 0.24 0.24 0.12 1.25 0.11 0.11 0.2 0.21 0.08 0.09 0.19 0.2 0.32 0.34 0.15 0.15 0.1 1.33 0.18 0.11 1.35 0.19 0.11 0.52 1 0.07 1.08 0.29
05 - 06 hr 0.91 2.87 0.08 0.08 0.66 0.69 0.8 2.96 0.15 0.17 0.43 0.45 0.32 0.33 0.39 0.41 0.76 0.82 0.47 0.49 0.59 3.27 0.51 0.58 3.19 0.56 0.45 1.85 2.19 0.43 2.52 0.79
06 - 07 hr 3.92 5.44 1.29 1.14 4.03 4.12 3.19 6.03 1.66 1.69 3.26 3.44 1.84 1.93 13.03 14.97 13.81 19.23 4.69 5.59 3.03 6.72 3.01 2.78 6.81 3.02 1.52 4.22 7.13 2.54 5.94 2.88
07 - 08 hr 4.68 6.35 1.67 1.8 5.69 5.95 4.01 6.81 2.69 2.78 3.84 4.03 3.27 4.05 24.25 24.39 28.07 36.5 15.39 15.38 3.79 7.91 4.28 3.58 7.91 4.33 1.67 6.16 8.3 3.17 6.59 3.74
08 - 09 hr 5.21 6.31 2.51 2.82 7.46 7.72 4.53 6.65 9.03 8.98 13.94 12.77 6.73 8.55 47 48.38 50.08 48.32 22.83 21.91 4.29 8.27 5.04 4.15 8.27 5.16 2.44 6.24 9.17 3.26 6.81 4.16
09 - 10 hr 4.05 6.17 1.01 0.99 4.47 4.55 3.45 6.53 1.68 1.74 3.12 3.24 1.82 1.98 16.49 16.22 15.58 20.36 4.31 5.2 3.36 7.24 3.55 3.03 7.36 3.73 1.89 5.64 6.75 2.24 5.94 3.26
10 - 11 hr 3.62 6.64 1.14 1.15 4.11 4.18 3.23 6.87 1.79 1.87 3.03 3.13 1.72 1.89 15.02 18.75 17.25 22.61 3.06 3.24 3.06 7.69 3.23 2.87 7.49 3.56 2.06 4.99 7.05 2.26 6.2 3.07
11 - 12 hr 4.31 6.74 1.29 1.3 4.91 4.97 3.79 6.88 2.25 2.35 3.97 4.29 1.84 1.98 19.58 22.5 24.92 23.14 3.27 3.46 3.65 7.67 3.96 3.43 7.74 4 2.39 5.4 7.67 2.57 6.42 3.22
12 - 13 hr 4.53 6.9 1.4 1.39 5.81 5.9 3.96 7.18 2.69 2.86 4.94 5.28 2.15 2.29 24.46 28.34 36.59 31.06 5.4 5.39 3.95 7.98 4.54 3.7 8.03 4.69 2.36 5.99 8.29 3.01 6.61 3.37
13 - 14 hr 6.13 7.29 1.57 1.55 6.02 6.11 5.34 7.74 2.67 2.76 5.2 5.56 2.04 2.16 23.74 28.31 31.01 36.89 4.15 4.6 5.22 8.83 4.77 4.96 8.84 4.92 2.65 6.56 9.77 3.96 7.23 3.88
14 - 15 hr 8.72 8.22 2.93 3.06 8.58 8.9 8.94 9.57 17.69 17.2 18.99 23.58 2.37 3.69 38.81 53.33 49.93 45.42 5.69 4.3 8.13 10.04 5.45 7.03 9.94 5.51 3.59 7.41 12.4 5.92 8.04 4.4
15 - 16 hr 13.26 9.75 15.68 18.3 22.21 23.25 10.8 9.06 35.31 37.1 36.27 35.89 3.14 2.91 47.93 54.86 51.96 50.74 6.27 5.77 11.82 12.78 7.62 12.03 12.5 6.55 4.71 9.21 17.87 9.06 9.33 4.5
16 - 17 hr 18.25 14.92 4.95 4.63 9.68 10.2 20.14 16.68 21.38 21.39 23.92 28.11 1.75 1.86 48.15 47.31 46.65 50.4 3.46 3.31 21.52 16.97 7.37 18.47 14.84 7.51 6.88 15.52 27.8 11.17 9.35 5.34
17 - 18 hr 13.82 9.76 31.23 31.46 34.89 36.06 13.72 11.14 41.24 44.5 42 47.07 1.6 1.62 57.4 58.92 57.23 62.92 3.41 8.01 20.26 20.35 15.25 21.49 20.5 9.31 12.27 17.3 34.46 22.89 20.56 12.04
18 - 19 hr 7.51 5.81 50.48 49.94 45.97 46.43 8.65 5.95 49.26 48.28 51.04 46.46 2 3.04 56.08 56.39 54.95 59.06 3.18 6.47 13.44 13.73 25.79 17.67 21.52 19.26 6.35 11.52 22.13 11.31 10.4 5.42
19 - 20 hr 3.96 5.01 2.77 2.71 6.62 6.87 3.65 5.19 7.72 7.86 9.5 10.44 1.17 1.44 23.6 30.16 28.82 30.87 1.73 1.76 3.6 6.52 4.04 3.38 6.51 4.03 1.88 5.05 7.15 2.99 5.44 3.1
20 - 21 hr 2.16 3.72 1.75 1.74 3.96 4.02 2.03 3.72 2.62 2.73 4.32 4.54 0.76 0.79 18.41 23.69 30.91 31.05 1.31 1.26 2.1 4.76 2.97 1.93 4.75 2.97 1.43 3.43 4.9 1.73 3.9 2.27
21 - 22 hr 2.03 3.81 1.49 1.47 2.97 2.99 2 3.79 2.11 2.15 3.07 3.27 0.37 0.4 12.96 14.05 15.49 17.93 0.64 0.67 1.86 4.87 2.35 1.75 4.88 2.29 1.14 3.4 4.44 1.57 3.89 1.92
22 - 23 hr 1.33 3.2 1.21 1.22 2.46 2.5 1.21 3.23 1.75 1.79 2.36 2.48 0.35 0.38 6.19 9.26 10.48 12.16 0.57 0.58 1.28 3.85 2 1.23 3.84 1.96 0.82 2.74 3.55 1.12 3.29 1.73
23 - 24 hr 0.65 2.43 0.49 0.49 1.41 1.44 0.69 2.35 0.69 0.7 1.3 1.38 0.19 0.21 1.51 1.66 2.46 2.45 0.41 0.42 0.71 2.63 1.06 0.59 2.73 1.04 0.4 1.8 2.25 0.58 2.28 0.94
Grand Total 4.57 5.26 5.23 5.32 7.64 7.85 4.36 5.56 8.54 8.73 9.83 10.29 1.49 1.74 20.68 23.05 23.71 25.17 3.78 4.1 4.84 6.98 4.5 4.79 7.21 3.98 2.39 5.29 8.59 3.84 5.63 2.97
Here is the current script I'm using.
read_occupancy = pd.read_csv (r'C:\Users\holborm\Desktop\Visualisation\dataaxisplotstuff.csv') #read the csv file (put 'r' before the path string to address any special characters, such as '\'). Don't forget to put the file name at the end of the path + ".csv"
df = DataFrame(read_occupancy) # assign column names
#create time and detector name axis
time_axis = df.index
detector_axis = df.columns
plt.plot(df)
Using Seaborn
read_occupancy = pd.read_csv (r'C:\Users\holborm\Desktop\Visualisation\dataaxisplotstuff.csv') #read the csv file (put 'r' before the path string to address any special characters, such as '\'). Don't forget to put the file name at the end of the path + ".csv"
df = DataFrame(read_occupancy) # assign column names
#create time and detector name axis
sns.heatmap(df)
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-79-33a3388e21cc> in <module>()
6 #create time and detector name axis
7
----> 8 sns.heatmap(df)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in heatmap(data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, linewidths, linecolor, cbar, cbar_kws, cbar_ax, square, xticklabels, yticklabels, mask, ax, **kwargs)
515 plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
516 annot_kws, cbar, cbar_kws, xticklabels,
--> 517 yticklabels, mask)
518
519 # Add the pcolormesh kwargs here
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in __init__(self, data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, cbar, cbar_kws, xticklabels, yticklabels, mask)
166 # Determine good default values for the colormapping
167 self._determine_cmap_params(plot_data, vmin, vmax,
--> 168 cmap, center, robust)
169
170 # Sort out the annotations
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in _determine_cmap_params(self, plot_data, vmin, vmax, cmap, center, robust)
203 cmap, center, robust):
204 """Use some heuristics to set good defaults for colorbar and range."""
--> 205 calc_data = plot_data.data[~np.isnan(plot_data.data)]
206 if vmin is None:
207 vmin = np.percentile(calc_data, 2) if robust else calc_data.min()
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
You can use .set_index('Row Labels) to ensure your Row Labels column is interpreted as an axis for the heatmap and transpose your DataFrame with .T so that you get the time along the x-axis and the detectors for the y-axis.
sns.heatmap(df.set_index('Row Labels').T)

Can't index by timestamp in pandas dataframe

I took an excel sheet which has dates and some values and want to convert them to pandas dataframe and select only rows which are between certain dates.
For some reason I cannot select a row by date index
Raw Data in Excel file
MCU
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12
12-Feb-15 25.17 5.88 5.92 5.98 6.18 6.23 6.33
11-Feb-15 25.9 6.05 6.09 6.15 6.28 6.31 6.39
10-Feb-15 26.38 5.94 6.05 6.15 6.33 6.39 6.46
Code
xls = pd.ExcelFile('e:/Data.xlsx')
vols = xls.parse(asset.upper()+'VOL',header=1)
vols.set_index('Timestamp',inplace=True)
Data before set_index
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 \
0 2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08
1 2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17
2 2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16
Data after set_index
50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 25P3 \
Timestamp
2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08 3.21
2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17 3.32
2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16 3.31
Output
>>> vols.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-12, ..., NaT]
Length: 1478, Freq: None, Timezone: None
>>> vols[date(2015,2,12)]
*** KeyError: datetime.date(2015, 2, 12)
I would expect this not to fail, and also I should be able to select a range of dates. Tried so many combinations but not getting it.
Using a datetime.date instance to try to retrieve the index won't work, you just need a string representation of the date, e.g. '2015-02-12' or '2015/02/14'.
Secondly, vols[date(2015,2,12)] is actually looking in your DataFrame's column headings, not the index. You can use loc to fetch row index labels instead. For example you could write vols.loc['2015-02-12']

pandas dataframe plotting 1 column over 2

this is driving me nuts, I can't plot column 'b'
it plots only column 'A'.....
this is my code, no idea what I'm doing wrong, probably something silly...
the dataframe seems ok, weirdness also is that I can access both df['A'] and df['b'] but only df['A'].plot() works, if I issue a df['b'].plot() I get this error :
Traceback (most recent call last): File
"C:\Python27\lib\site-packages\IPython\core\interactiveshell.py", line
2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in
df['b'].plot() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2511,
in plot_series
**kwds) File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2317,
in _plot
plot_obj.generate() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 921, in
generate
self._compute_plot_data() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 997, in
_compute_plot_data
'plot'.format(numeric_data.class.name)) TypeError: Empty 'Series': no numeric data to plot
import sqlalchemy
import pandas as pd
import matplotlib.pyplot as plt
engine = sqlalchemy.create_engine(
'sqlite:///C:/Users/toto/PycharmProjects/my_db.sqlite')
tables = engine.table_names()
dic = {}
for t in tables:
sql = 'SELECT t."weight" FROM "' + t + '" t WHERE t."udl"="IBE SM"'
dic[t] = (pd.read_sql(sql, engine)['weight'][0], pd.read_sql(sql, engine)['weight'][1])
df = pd.DataFrame.from_dict(dic, orient='index').sort_index()
df = df.set_index(pd.DatetimeIndex(df.index))
df.columns = ['A', 'b']
print(df)
print(df.info())
df.plot()
plt.show()
this is the 2 print
A b
2014-08-05 1.81 3.39
2014-08-06 1.81 3.39
2014-08-07 1.81 3.39
2014-08-08 1.80 3.37
2014-08-11 1.79 3.35
2014-08-13 1.80 3.36
2014-08-14 1.80 3.35
2014-08-18 1.80 3.35
2014-08-19 1.79 3.34
2014-08-20 1.80 3.35
2014-08-27 1.79 3.35
2014-08-28 1.80 3.35
2014-08-29 1.79 3.35
2014-09-01 1.79 3.35
2014-09-02 1.79 3.35
2014-09-03 1.79 3.36
2014-09-04 1.79 3.37
2014-09-05 1.80 3.38
2014-09-08 1.79 3.36
2014-09-09 1.79 3.35
2014-09-10 1.78 3.35
2014-09-11 1.78 3.34
2014-09-12 1.78 3.34
2014-09-15 1.78 3.35
2014-09-16 1.78 3.35
2014-09-17 1.78 3.35
2014-09-18 1.78 3.34
2014-09-19 1.79 3.35
2014-09-22 1.79 3.36
2014-09-23 1.80 3.37
... ... ...
2014-12-10 1.73 3.29
2014-12-11 1.74 3.27
2014-12-12 1.74 3.25
2014-12-15 1.74 3.24
2014-12-16 1.74 3.27
2014-12-17 1.75 3.28
2014-12-18 1.76 3.29
2014-12-19 1.04 1.39
2014-12-22 1.04 1.39
2014-12-23 1.04 1.4
2014-12-24 1.04 1.39
2014-12-29 1.04 1.39
2014-12-30 1.04 1.4
2015-01-02 1.04 1.4
2015-01-05 1.04 1.4
2015-01-06 1.04 1.4
2015-01-07 NaN 1.39
2015-01-08 NaN 1.39
2015-01-09 NaN 1.39
2015-01-12 NaN 1.38
2015-01-13 NaN 1.38
2015-01-14 NaN 1.38
2015-01-15 NaN 1.38
2015-01-16 NaN 1.38
2015-01-19 NaN 1.39
2015-01-20 NaN 1.38
2015-01-21 NaN 1.39
2015-01-22 NaN 1.4
2015-01-23 NaN 1,4
2015-01-26 NaN 1.41
[107 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 107 entries, 2014-08-05 00:00:00 to 2015-01-26 00:00:00
Data columns (total 2 columns):
A 93 non-null float64
b 107 non-null object
dtypes: float64(1), object(1)
memory usage: 2.1+ KB
None
Process finished with exit code 0
just got it, 'b' is of object type and not float64 because of this line :
2015-01-23 NaN 1,4

Categories

Resources