I'm working on a project involving railway tracks and I'm trying to find an algorithm that could detect curves(left/right) or straight lines based on time-series GPS coordinates.
The data contains latitude, longitude, and altitude values along with many different sensor readings of a vehicle in a specific range of time.
Example dataframe of a curve looks as follows:
latitude longitude altitude
1 43.46724 -5.823470 145.0
2 43.46726 -5.823653 145.2
3 43.46728 -5.823837 145.4
4 43.46730 -5.824022 145.6
5 43.46730 -5.824022 145.6
6 43.46734 -5.824394 146.0
7 43.46738 -5.824768 146.3
8 43.46738 -5.824768 146.3
9 43.46742 -5.825146 146.7
10 43.46742 -5.825146 146.7
11 43.46746 -5.825527 147.1
12 43.46746 -5.825527 147.1
13 43.46750 -5.825910 147.3
14 43.46751 -5.826103 147.4
15 43.46753 -5.826295 147.6
16 43.46753 -5.826489 147.8
17 43.46753 -5.826685 148.1
18 43.46753 -5.826878 148.2
19 43.46752 -5.827073 148.4
20 43.46750 -5.827266 148.6
21 43.46748 -5.827458 148.9
22 43.46744 -5.827650 149.2
23 43.46741 -5.827839 149.5
24 43.46736 -5.828029 149.7
25 43.46731 -5.828212 150.1
26 43.46726 -5.828393 150.4
27 43.46720 -5.828572 150.5
28 43.46713 -5.828746 150.8
29 43.46706 -5.828914 151.0
30 43.46698 -5.829078 151.2
31 43.46690 -5.829237 151.4
32 43.46681 -5.829392 151.6
33 43.46671 -5.829540 151.8
34 43.46661 -5.829680 152.0
35 43.46650 -5.829816 152.2
36 43.46639 -5.829945 152.4
37 43.46628 -5.830066 152.4
38 43.46616 -5.830180 152.4
39 43.46604 -5.830287 152.5
40 43.46591 -5.830384 152.6
41 43.46579 -5.830472 152.8
42 43.46566 -5.830552 152.9
43 43.46552 -5.830623 153.2
44 43.46539 -5.830687 153.4
45 43.46526 -5.830745 153.6
46 43.46512 -5.830795 153.8
47 43.46499 -5.830838 153.9
48 43.46485 -5.830871 153.9
49 43.46471 -5.830895 154.0
50 43.46458 -5.830911 154.2
51 43.46445 -5.830919 154.3
52 43.46432 -5.830914 154.7
53 43.46418 -5.830896 155.1
54 43.46406 -5.830874 155.6
55 43.46393 -5.830842 155.9
56 43.46381 -5.830803 156.0
57 43.46368 -5.830755 155.5
58 43.46356 -5.830700 155.3
59 43.46332 -5.830575 155.1
I've found out about spline interpolation on this old post asking the same question and decided to apply it in my problem:
from scipy.interpolate import make_interp_spline
## read csv file with pandas
df = pd.read_csv("Curvas/Curva_2.csv")
# take latitude and longitude columns
df['latitude'].fillna(method='ffill',inplace=True)
df['longitude'].fillna(method='ffill',inplace=True)
# plot the data
# df.plot(x='longitude', y='latitude', style='o')
# plt.show()
# using longitude and latitude data, use spline interpolation to create a new curve
x = df['longitude']
y = df['latitude']
xnew = np.linspace(x.min(), x.max(), x.shape[0])
ynew = make_interp_spline(xnew, y)(x)
plt.plot(xnew, ynew, zorder=2)
plt.show()
## Error results using different coordinates/routes
## Curve_1 → Left (e = 0.04818886515888465)
## Curve_2 → Left (e = 0.019459215874292113)
## Straight_1 → Straight (e = 0.03839597167971931)
I've calculated the error between the interpolated points and the real ones but I'm not quite sure how to proceed next or what threshold to use to figure out the direction.
Related
I have a dataset that consists of 5 rows that are formed like a curve. I want to separate the inner row from the other or if possible each row and store them in a separate array. Is there any way to do this, like somehow flatten the curved data and sorting it afterwards based on the x and y values?
I would like to assign each row from left to right numbers from 0 to the max of the row. Right now the labels for each dot are not useful for me and I can't change the labels.
Here are the first 50 data points of my data set:
x y
0 -6.4165 0.3716
1 -4.0227 2.63
2 -7.206 3.0652
3 -3.2584 -0.0392
4 -0.7565 2.1039
5 -0.0498 -0.5159
6 2.363 1.5329
7 -10.7253 3.4654
8 -8.0621 5.9083
9 -4.6328 5.3028
10 -1.4237 4.8455
11 1.8047 4.2297
12 4.8147 3.6074
13 -5.3504 8.1889
14 -1.7743 7.6165
15 1.1783 6.9698
16 4.3471 6.2411
17 7.4067 5.5988
18 -2.6037 10.4623
19 0.8613 9.7628
20 3.8054 9.0202
21 7.023 8.1962
22 9.9776 7.5563
23 0.1733 12.6547
24 3.7137 11.9097
25 6.4672 10.9363
26 9.6489 10.1246
27 12.5674 9.3369
28 3.2124 14.7492
29 6.4983 13.7562
30 9.2606 12.7241
31 12.4003 11.878
32 15.3578 11.0027
33 6.3128 16.7014
34 9.7676 15.6557
35 12.2103 14.4967
36 15.3182 13.5166
37 18.2495 12.5836
38 9.3947 18.5506
39 12.496 17.2993
40 15.3987 16.2716
41 18.2212 15.1871
42 21.1241 14.0893
43 12.3548 20.2538
44 15.3682 18.9439
45 18.357 17.8862
46 21.0834 16.6258
47 23.9992 15.4145
48 15.3776 21.9402
49 18.3568 20.5803
50 21.1733 19.3041
It seems that your curves have a pattern, so you could select the curve of interest using splicing. I had the offset the selection slightly to get the five curves because the first 8 points are not in the same order as the rest of the data. So the initial 8 data points are discarded. But these could be added back in afterwards if required.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'x': [-6.4165, -4.0227, -7.206, -3.2584, -0.7565, -0.0498, 2.363, -10.7253, -8.0621, -4.6328, -1.4237, 1.8047, 4.8147, -5.3504, -1.7743, 1.1783, 4.3471, 7.4067, -2.6037, 0.8613, 3.8054, 7.023, 9.9776, 0.1733, 3.7137, 6.4672, 9.6489, 12.5674, 3.2124, 6.4983, 9.2606, 12.4003, 15.3578, 6.3128, 9.7676, 12.2103, 15.3182, 18.2495, 9.3947, 12.496, 15.3987, 18.2212, 21.1241, 12.3548, 15.3682, 18.357, 21.0834, 23.9992, 15.3776, 18.3568, 21.1733],
'y': [0.3716, 2.63, 3.0652, -0.0392, 2.1039, -0.5159, 1.5329, 3.4654, 5.9083, 5.3028, 4.8455, 4.2297, 3.6074, 8.1889, 7.6165, 6.9698, 6.2411, 5.5988, 10.4623, 9.7628, 9.0202, 8.1962, 7.5563, 12.6547, 11.9097, 10.9363, 10.1246, 9.3369, 14.7492, 13.7562, 12.7241, 11.878, 11.0027, 16.7014, 15.6557, 14.4967, 13.5166, 12.5836, 18.5506, 17.2993, 16.2716, 15.1871, 14.0893, 20.2538, 18.9439, 17.8862, 16.6258, 15.4145, 21.9402, 20.5803, 19.3041]})
# Generate the 5 dataframes
df_list = [df.iloc[i+8::5, :] for i in range(5)]
# Generate the plot
fig = plt.figure()
for frame in df_list:
plt.scatter(frame['x'], frame['y'])
plt.show()
# Print the data of the innermost curve
print(df_list[4])
OUTPUT:
The 5th dataframe df_list[4] contains the data of the innermost plot.
x y
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
You can then add the missing data like this:
# Retrieve the two missing points of the inner curve
inner_curve = pd.concat([df_list[4], df[5:7]]).sort_index(ascending=True)
print(inner_curve)
# Plot the inner curve only
fig2 = plt.figure()
plt.scatter(inner_curve['x'], inner_curve['y'], color = '#9467BD')
plt.show()
OUTPUT: inner curve
x y
5 -0.0498 -0.5159
6 2.3630 1.5329
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
Complete Inner Curve
Lets say I have an excel document with the following format. I'm reading said excel doc with pandas and plotting data using matplotlib and numpy. Everything is great!
Buttttt..... I wan't more constraints. Now I want to constrain my data so that I can sort for only specific zenith angles and azimuth angles. More specifically: I only want zenith when it is between 30 and 90, and I only want azimuth when it is between 30 and 330
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
0 81 10
70 35 7
110 90 17
270 45 23
330 45 13
345 47 6
175 82 7
220 7 8
This is an example of the sort of constraint I'm looking for.
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
70 35 7
110 90 17
270 45 23
330 45 13
175 82 7
The following is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
P_file = file1
out_file = file2
out_file2 = file3
data = pd.read_csv(file1,header=None,sep=' ')
df=pd.DataFrame(data=data)
df.to_csv(file2,sep=',',header = [19 headers. The three that matter for this question are 'DateTime', 'Zenith', 'Azimuth', and 'Ozone Amount'.]
df=pd.read_csv(file2,header='infer')
mask = df[df['DateTime'].str.contains('20141201')] ## In this line I'm sorting for anything containing the locator for the given day.
mask.to_csv(file2) ##I'm now updating file 2 so that it only has the data I want sorted for.
data2 = pd.read_csv(file2,header='infer')
df2=pd.DataFrame(data=data2)
def tojuliandate(date):
return.... ##give a function that changes normal date of format %Y%m%dT%H%M%SZ to julian date format of %y%j
def timeofday(date):
changes %Y%m%dT%H%M%SZ to %H%M%S for more narrow views of data
df2['Time of Day'] = df2['DateTime'].apply(timeofday)
df2.to_csv(file2) ##adds a column for "timeofday" to the file
So basically at this point this is all the code that goes into making the csv I want to sort. How would I go about sorting
'Zenith' and 'Azimuth'
If they met the criteria I specified above?
I know that I will need if statements to do this.
I tried something like this but it didn't work and I was looking for a bit of help:
df[(df["Zenith"]>30) & (df["Zenith"]<90) & (df["Azimuth"]>30) & (df["Azimuth"]<330)]
Basically a duplicate of Efficient way to apply multiple filters to pandas DataFrame or Series
You can use series between:
df[(df['Zenith'].between(30, 90)) & (df['Azimuth'].between(30, 330))]
Yields:
Azimuth Zenith Ozone Amount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7
Note that by default, these upper and lower bounds are inclusive (inclusive=True).
You can only write those entries of the dataframe to your file, which are meeting your boundary conditions
# replace the line df.to_csv(...) in your example with
df[((df['Zenith'] >= 3) & (df['Zenith'] <= 90)) and
((df['Azimuth'] >= 30) & (df['Azimuth'] <= 330))].to_csv('my_csv.csv')
Using pd.DataFrame.query:
df_new = df.query('30 <= Zenith <= 90 and 30 <= Azimuth <= 330')
print(df_new)
Azimuth Zenith OzoneAmount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7
Having a table "tempcc" of value with x,y geografic coords (don't know attaching files here, there is 86 rows in my csv):
X Y Temp
0 35.268 55.618 1.065389
1 35.230 55.682 1.119160
2 35.508 55.690 1.026214
3 35.482 55.652 1.007834
4 35.289 55.664 1.087598
5 35.239 55.655 1.099459
6 35.345 55.662 1.066117
7 35.402 55.649 1.035958
8 35.506 55.643 0.991939
9 35.526 55.688 1.018137
10 35.541 55.695 1.017870
11 35.471 55.682 1.033929
12 35.573 55.668 0.985559
13 35.547 55.651 0.982335
14 35.425 55.671 1.042975
15 35.505 55.675 1.016236
16 35.600 55.681 0.985532
17 35.458 55.717 1.063691
18 35.538 55.720 1.037523
19 35.230 55.726 1.146047
20 35.606 55.707 1.003364
21 35.582 55.700 1.006711
22 35.350 55.696 1.087173
23 35.309 55.677 1.088988
24 35.563 55.687 1.003785
25 35.510 55.764 1.079220
26 35.334 55.736 1.119026
27 35.429 55.745 1.093300
28 35.366 55.752 1.119061
29 35.501 55.745 1.068676
.. ... ... ...
56 35.472 55.800 1.117183
57 35.538 55.855 1.134721
58 35.507 55.834 1.129712
59 35.256 55.845 1.211969
60 35.338 55.823 1.174397
61 35.404 55.835 1.162387
62 35.460 55.826 1.138965
63 35.497 55.831 1.130774
64 35.469 55.844 1.148516
65 35.371 55.510 0.945187
66 35.378 55.545 0.969400
67 35.456 55.502 0.902285
68 35.429 55.517 0.925932
69 35.367 55.710 1.090652
70 35.431 55.490 0.903296
71 35.284 55.606 1.051335
72 35.234 55.634 1.088135
73 35.284 55.591 1.041181
74 35.354 55.587 1.010446
75 35.332 55.581 1.015004
76 35.356 55.606 1.023234
77 35.311 55.545 0.997468
78 35.307 55.575 1.020845
79 35.363 55.645 1.047831
80 35.401 55.628 1.021373
81 35.340 55.629 1.045491
82 35.440 55.643 1.017227
83 35.293 55.630 1.063910
84 35.370 55.623 1.029797
85 35.238 55.601 1.065699
I try to create isolines with:
from numpy import meshgrid,linspace
data=tempcc
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),\
lon_0 = np.mean(tempcc['X'].values),\
llcrnrlon=35,llcrnrlat=55.3, \
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
x = linspace(m.llcrnrlon, m.urcrnrlon, data.shape[1])
y = linspace(m.llcrnrlat, m.urcrnrlat, data.shape[0])
xx, yy = meshgrid(x, y)
m.contour(xx, yy, data,latlon=True)
#pt.legend()
m.scatter(tempcc['X'].values, tempcc['Y'].values, latlon=True)
#m.contour(x,y,data,latlon=True)
But I can't manage correctly, although everything seems to be fine. As far as I understand I have to make a 2D matrix of values, where i is lat, and j is lon, but I can't find the example.
The result I get
as you see, region is correct, but interpolation is not good.
What's the matter? Which parameter have I forgotten?
You could use a Triangulation and then call tricontour() instead of contour()
import matplotlib.pyplot as plt
from matplotlib.tri import Triangulation
from mpl_toolkits.basemap import Basemap
import numpy
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),
lon_0 = np.mean(tempcc['X'].values),
llcrnrlon=35,llcrnrlat=55.3,
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
triMesh = Triangulation(tempcc['X'].values, tempcc['Y'].values)
tctr = m.tricontour(triMesh, tempcc['Temp'].values,
levels=numpy.linspace(min(tempcc['Temp'].values),
max(tempcc['Temp'].values), 7),
latlon=True)
I have been trying to make a bokeh line chart, however I am running into the issue of indexing the x-axis with a column of time stamps from my pandas data frame. Currently my data frame looks like this:
TMAX TMIN TAVG DAY NUM
2007-04-30 65 46 55.5 2007-04-30 1
2007-05-01 75 45 60.0 2007-05-01 2
2007-05-02 66 52 59.0 2007-05-02 3
2007-05-03 65 43 54.0 2007-05-03 4
2007-05-04 61 45 53.0 2007-05-04 5
2007-05-05 65 43 54.0 2007-05-05 6
2007-05-06 77 51 64.0 2007-05-06 7
2007-05-07 89 66 77.5 2007-05-07 8
2007-05-08 91 56 73.5 2007-05-08 9
2007-05-09 83 48 65.5 2007-05-09 10
2007-05-10 68 47 57.5 2007-05-10 11
2007-05-11 65 46 55.5 2007-05-11 12
2007-05-12 63 43 53.0 2007-05-12 13
2007-05-13 65 46 55.5 2007-05-13 14
2007-05-14 71 46 58.5 2007-05-14 15
....
[3592 rows x 5 columns]
I want to index the line plot with the values of the "DAY" column, however, I get an error no matter the approach I take. The documentation for line plots says that "x (str or list(str), optional) – specifies variable(s) to use for x axis". My code is as follows:
xyvalues = np.array([df['TAVG'], df_reg['ry'], df['DAY']])
regr = Line(data=xyvalues, x='DAY', title="Linear Regression of Data", ylabel="Average Daily Temperature", xlabel="Number of Days")
output_file("regression.html")
show(regr)
This gives me the error "TypeError: Cannot compare type 'Timestamp' with type 'float64'". I have tried converting it to float, but it doesn't seem to have an effect. Any help would be much appreciated. The df_reg['ry'] is data from a linear regression data frame.
Documentation for line graphs can be found here: http://docs.bokeh.org/en/latest/docs/reference/charts.html#line
Inside Line, you need to pass a pandas data frame to the data argument in order to be able to refer to your variable DAY for the x axis ticks. Here I create a new pandas DataFrame from the other two:
import pandas as pd
df2 = pd.DataFrame(data=dict(TAVG=df['TAVG'], ry=df_reg['ry'], DAY=df['DAY']))
regr = Line(data=df2, x='DAY',
title="Linear Regression of Data",
ylabel="Average Daily Temperature",
xlabel="Number of Days")
output_file("regression.html")
show(regr)
I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')