I have been trying to make a bokeh line chart, however I am running into the issue of indexing the x-axis with a column of time stamps from my pandas data frame. Currently my data frame looks like this:
TMAX TMIN TAVG DAY NUM
2007-04-30 65 46 55.5 2007-04-30 1
2007-05-01 75 45 60.0 2007-05-01 2
2007-05-02 66 52 59.0 2007-05-02 3
2007-05-03 65 43 54.0 2007-05-03 4
2007-05-04 61 45 53.0 2007-05-04 5
2007-05-05 65 43 54.0 2007-05-05 6
2007-05-06 77 51 64.0 2007-05-06 7
2007-05-07 89 66 77.5 2007-05-07 8
2007-05-08 91 56 73.5 2007-05-08 9
2007-05-09 83 48 65.5 2007-05-09 10
2007-05-10 68 47 57.5 2007-05-10 11
2007-05-11 65 46 55.5 2007-05-11 12
2007-05-12 63 43 53.0 2007-05-12 13
2007-05-13 65 46 55.5 2007-05-13 14
2007-05-14 71 46 58.5 2007-05-14 15
....
[3592 rows x 5 columns]
I want to index the line plot with the values of the "DAY" column, however, I get an error no matter the approach I take. The documentation for line plots says that "x (str or list(str), optional) – specifies variable(s) to use for x axis". My code is as follows:
xyvalues = np.array([df['TAVG'], df_reg['ry'], df['DAY']])
regr = Line(data=xyvalues, x='DAY', title="Linear Regression of Data", ylabel="Average Daily Temperature", xlabel="Number of Days")
output_file("regression.html")
show(regr)
This gives me the error "TypeError: Cannot compare type 'Timestamp' with type 'float64'". I have tried converting it to float, but it doesn't seem to have an effect. Any help would be much appreciated. The df_reg['ry'] is data from a linear regression data frame.
Documentation for line graphs can be found here: http://docs.bokeh.org/en/latest/docs/reference/charts.html#line
Inside Line, you need to pass a pandas data frame to the data argument in order to be able to refer to your variable DAY for the x axis ticks. Here I create a new pandas DataFrame from the other two:
import pandas as pd
df2 = pd.DataFrame(data=dict(TAVG=df['TAVG'], ry=df_reg['ry'], DAY=df['DAY']))
regr = Line(data=df2, x='DAY',
title="Linear Regression of Data",
ylabel="Average Daily Temperature",
xlabel="Number of Days")
output_file("regression.html")
show(regr)
Related
I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69
I'm working on a project involving railway tracks and I'm trying to find an algorithm that could detect curves(left/right) or straight lines based on time-series GPS coordinates.
The data contains latitude, longitude, and altitude values along with many different sensor readings of a vehicle in a specific range of time.
Example dataframe of a curve looks as follows:
latitude longitude altitude
1 43.46724 -5.823470 145.0
2 43.46726 -5.823653 145.2
3 43.46728 -5.823837 145.4
4 43.46730 -5.824022 145.6
5 43.46730 -5.824022 145.6
6 43.46734 -5.824394 146.0
7 43.46738 -5.824768 146.3
8 43.46738 -5.824768 146.3
9 43.46742 -5.825146 146.7
10 43.46742 -5.825146 146.7
11 43.46746 -5.825527 147.1
12 43.46746 -5.825527 147.1
13 43.46750 -5.825910 147.3
14 43.46751 -5.826103 147.4
15 43.46753 -5.826295 147.6
16 43.46753 -5.826489 147.8
17 43.46753 -5.826685 148.1
18 43.46753 -5.826878 148.2
19 43.46752 -5.827073 148.4
20 43.46750 -5.827266 148.6
21 43.46748 -5.827458 148.9
22 43.46744 -5.827650 149.2
23 43.46741 -5.827839 149.5
24 43.46736 -5.828029 149.7
25 43.46731 -5.828212 150.1
26 43.46726 -5.828393 150.4
27 43.46720 -5.828572 150.5
28 43.46713 -5.828746 150.8
29 43.46706 -5.828914 151.0
30 43.46698 -5.829078 151.2
31 43.46690 -5.829237 151.4
32 43.46681 -5.829392 151.6
33 43.46671 -5.829540 151.8
34 43.46661 -5.829680 152.0
35 43.46650 -5.829816 152.2
36 43.46639 -5.829945 152.4
37 43.46628 -5.830066 152.4
38 43.46616 -5.830180 152.4
39 43.46604 -5.830287 152.5
40 43.46591 -5.830384 152.6
41 43.46579 -5.830472 152.8
42 43.46566 -5.830552 152.9
43 43.46552 -5.830623 153.2
44 43.46539 -5.830687 153.4
45 43.46526 -5.830745 153.6
46 43.46512 -5.830795 153.8
47 43.46499 -5.830838 153.9
48 43.46485 -5.830871 153.9
49 43.46471 -5.830895 154.0
50 43.46458 -5.830911 154.2
51 43.46445 -5.830919 154.3
52 43.46432 -5.830914 154.7
53 43.46418 -5.830896 155.1
54 43.46406 -5.830874 155.6
55 43.46393 -5.830842 155.9
56 43.46381 -5.830803 156.0
57 43.46368 -5.830755 155.5
58 43.46356 -5.830700 155.3
59 43.46332 -5.830575 155.1
I've found out about spline interpolation on this old post asking the same question and decided to apply it in my problem:
from scipy.interpolate import make_interp_spline
## read csv file with pandas
df = pd.read_csv("Curvas/Curva_2.csv")
# take latitude and longitude columns
df['latitude'].fillna(method='ffill',inplace=True)
df['longitude'].fillna(method='ffill',inplace=True)
# plot the data
# df.plot(x='longitude', y='latitude', style='o')
# plt.show()
# using longitude and latitude data, use spline interpolation to create a new curve
x = df['longitude']
y = df['latitude']
xnew = np.linspace(x.min(), x.max(), x.shape[0])
ynew = make_interp_spline(xnew, y)(x)
plt.plot(xnew, ynew, zorder=2)
plt.show()
## Error results using different coordinates/routes
## Curve_1 → Left (e = 0.04818886515888465)
## Curve_2 → Left (e = 0.019459215874292113)
## Straight_1 → Straight (e = 0.03839597167971931)
I've calculated the error between the interpolated points and the real ones but I'm not quite sure how to proceed next or what threshold to use to figure out the direction.
As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))
I am trying to build a function that uses .shift() but it is giving me an error.
Consider this:
In [40]:
data={'level1':[20,19,20,21,25,29,30,31,30,29,31],
'level2': [10,10,20,20,20,10,10,20,20,10,10]}
index= pd.date_range('12/1/2014', periods=11)
frame=DataFrame(data, index=index)
frame
Out[40]:
level1 level2
2014-12-01 20 10
2014-12-02 19 10
2014-12-03 20 20
2014-12-04 21 20
2014-12-05 25 20
2014-12-06 29 10
2014-12-07 30 10
2014-12-08 31 20
2014-12-09 30 20
2014-12-10 29 10
2014-12-11 31 10
A normal function works fine. To demonstrate I calculate the same result twice, using the direct and function approach:
In [63]:
frame['horizontaladd1']=frame['level1']+frame['level2']#works
def horizontaladd(x):
test=x['level1']+x['level2']
return test
frame['horizontaladd2']=frame.apply(horizontaladd, axis=1)
frame
Out[63]:
level1 level2 horizontaladd1 horizontaladd2
2014-12-01 20 10 30 30
2014-12-02 19 10 29 29
2014-12-03 20 20 40 40
2014-12-04 21 20 41 41
2014-12-05 25 20 45 45
2014-12-06 29 10 39 39
2014-12-07 30 10 40 40
2014-12-08 31 20 51 51
2014-12-09 30 20 50 50
2014-12-10 29 10 39 39
2014-12-11 31 10 41 41
But while directly applying shift works, in a function it doesn't work:
frame['verticaladd1']=frame['level1']+frame['level1'].shift(1)#works
def verticaladd(x):
test=x['level1']+x['level1'].shift(1)
return test
frame.apply(verticaladd)#error
results in
KeyError: ('level1', u'occurred at index level1')
I also tried applying to a single column which makes more sense in my mind, but no luck:
def verticaladd2(x):
test=x-x.shift(1)
return test
frame['level1'].map(verticaladd2)#error, also with apply
error:
AttributeError: 'numpy.int64' object has no attribute 'shift'
Why not call shift directly? I need to embed it into a function to calculate multiple columns at the same time, along axis 1. See related question Ambiguous truth value with boolean logic
Try passing the frame to the function, rather than using apply (I am not sure why apply doesn't work, even column-wise):
def f(x):
x.level1
return x.level1 + x.level1.shift(1)
f(frame)
returns:
2014-12-01 NaN
2014-12-02 39
2014-12-03 39
2014-12-04 41
2014-12-05 46
2014-12-06 54
2014-12-07 59
2014-12-08 61
2014-12-09 61
2014-12-10 59
2014-12-11 60
Freq: D, Name: level1, dtype: float64
Check if the values you are trying to shift is not an array. Then you need to convert the array to series. With this you will be able to shift the values. I was having same issues,now I am able to get the shift values.
This is my part of the code for your reference.
X = grouped['Confirmed_day'].values
X_series=pd.Series(X)
X_lag1 = X_series.shift(1)
I'm not entirely following along, but if frame['level1'].shift(1) works, then I can only imagine that frame['level1'] is not a numpy.int64 object while whatever you are passing into the verticaladd function is. Probably need to look at your types.