Python: Trouble running any additional queries on Geopy output? - python

have encountered the following - Struggling to wrap my head around it.
Have some data that looks like this:
I've written the following Python that works out the distance between the 2 sets of coordinates:
from geopy import distance
# Calculate distance between 2 sets of coordinates
# Result is float64
data['Distance'] = data[['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng']].apply(lambda x: distance.distance((x[0],x[1]), (x[2],x[3])).km, axis=1)
print(data['Distance'])
# Create quantiles
data["DisBucket"] = pd.qcut(df_nyc.Aftermath, q=[0, 0.3, 0.7, 1.0], labels=['LOW', 'MEDIUM', 'HIGH'])
The first bit works fine and returns the following as a float64:
The second bit however fails and returns the following:
It doesn't seem to like the output from Geopy for whatever reason. I haven't been able to work out away around this. Is there potentially a way to copy across the values without the association to Geopy?
Any advice would be greatly appreciated :)

#Create sample data
dat = [[43.11944,-75.2932, 40.12029, -74.2935],[40.83488,-75.8662, 40.83377, -73.8633],[40.81212,-73.9165, 40.80491, -73.9112],
[43.07367,-78.9906, 43.07523, -78.9906],
[41.30884,-74.0253, 40.30746, -74.028]]
data = pd.DataFrame(dat, columns=['Start_lat', 'Start_Lng', 'End_Lat', 'End_Lng'])
#Calculate distance between 2 sets of coordinates
data['Distance'] = data.apply(lambda x: distance.distance((x['Start_lat'], x['Start_Lng']), (x['End_Lat'], x['End_Lng'])).km, axis=1)
#Create quantile
data['DisBucket'] = pd.qcut(data.Distance, q=[0, 0.3, 0.7, 1.0], labels=['LOW', 'MEDIUM', 'HIGH'])

Related

Is there a python package that allows you to model a time series of compositional data?

I have a 2 time series that look like this:
import pandas as pd
series_1 = pd.DataFrame({'time': [0,1,2,3,4], 'value_1': [0.3, 0.5, 0.4, 0.8, 0.7]})
series_2 = pd.DataFrame({'time': [0,1,2,3,4], 'value_2': [0.7, 0.5, 0.6, 0.2, 0.3]})
As you can notice, at each point in time the sum of value_ is equal to 1.
From what I read this type of time series is called "compositional".
My question is, is there a python package that can help me model this type of time series ?
I have tried using prophet to model each series_ separately, and later scale the forecasting values so that they sum to 1, but I am not sure if this approach is appropriate for this type of time series data, any thoughts on that ?

Pyplot - show x-axis labels according to y-axis value

I have 1min 20s long video record of 23.813 FPS. More precisely, I have 1923 frames in which I've been scanning desired features. I've detected some specific behavior via neural network and using chosen metric I calculated a value for each frame.
So, now, I have X-Y values to plot a graph:
X: time (each step of size 0,041993869s)
Y: a value measured by neural network
In the default state, the plot looks like this:
So, I've tried to limit the number of bins in the faith that the bins will be spread over all my values. But they are not. As you can see, only first fifteen x-values are rendered:
pyplot.locator_params(axis='x', nbins=15)
But neither one is desired state. The desired state should render the labels of such x-bins with y-value higher than e.g. 1.2. So, it should look like this:
Is possible to achieve such result?
Code:
# draw plot
from pandas import read_csv
from matplotlib import pyplot
test_video_fps = 23.813
df = read_csv('/path/to/csv/file/file.csv', header=None)
df.columns = ['anomaly']
df['time'] = [round((i + 1) / test_video_fps, 2) for i in range(df.shape[0])]
axes = df.plot.bar(x='time', y='anomaly', rot='0')
# pyplot.locator_params(axis='x', nbins=15)
# axes.get_xaxis().set_visible(False)
fig = pyplot.gcf()
fig.set_size_inches(16, 10)
fig.savefig('/path/to/output/plot.png', dpi=100)
# pyplot.show()
Example:
Simple example with a subset of original data.
0.379799
0.383786
0.345488
0.433286
0.469474
0.431993
0.474253
0.418843
0.491070
0.447778
0.384890
0.410994
0.898229
1.872756
2.907009
3.691382
4.685749
4.599612
3.738768
8.043357
7.660785
2.311198
1.956096
2.877326
3.467511
3.896339
4.250552
6.485533
7.452986
7.103761
2.684189
2.516134
1.512196
1.435303
0.852047
0.842551
0.957888
0.983085
0.990608
1.046679
1.082040
1.119655
0.962391
1.263255
1.371034
1.652812
2.160451
2.646674
1.460051
1.163745
0.938030
0.862976
0.734119
0.567076
0.417270
Desired plot:
Your question has become a two-part problem, but it is interesting enough that I will answer both.
I will answer this in Matplotlib object oriented notation with numpy data rather than pandas. This will make things easier to explain, and can be easily generalized to pandas.
I will assume that you have the following two data arrays:
dt = 0.041993869
x = np.arange(0.0, 15 * dt, dt)
y = np.array([1., 1.1, 1.3, 7.6, 2.4, 0.8, 0.7, 0.8, 1.0, 1.5, 10.0, 4.5, 3.2, 0.9, 0.7])
Part 1: Identifying the locations where you want labels
The data can be masked to get the locations of the peaks:
mask = y > 1.2
Consecutive peaks can be easily eliminated by computing the diff. A diff of a boolean mask will be True at the locations where the mask changes sense. You will then have to take every other element to get the locations where it goes from False to True. The following code will capture all the corner cases where you start with a peak or end in the middle of a peak:
d = np.flatnonzero(np.diff(mask))
if mask[d[0]]: # First diff is end of peak: True to False
d = np.concatenate(([0], d[1::2] + 1))
else:
d = d[::2] + 1
d is now an array indices into x and y that represent the first element of each run of peaks. You can get the last element by swapping the indices [1::2] and [::2] in the if-else statement, and removing the + 1 in both cases.
The locations of the labels are now simply x[d].
Part 2: Locating and formatting the labels
For this part, you will need to access Matplotlib's object oriented API via the Axes object you are plotting on. You already have this in the pandas form, making the transfer easy. Here is a sample in raw Matplotlib:
fig, axes = plt.subplots()
axes.plot(x, y)
Now use the ticker API to easily set the locations and labels. You actually set the locations directly (not with a Locator) since you have a very fixed list of ticks:
axes.set_xticks(x[d])
axes.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:0.01g}s'))
For the sample data show here, you get

Python: Is there a way to apply a function separately on all columns of a dataframe specified in a list? Pyfinance package

I would appreciate, if someone could give a me good answer, as I am not moving forward currently. Neither any resources helped my and my tries to create a loop failed.
I have a big dataframe of stock returns, I tried to create a test dataset of it below. The used function to derive rolling betas on the stock (vs. market) returns works well. However, I don't manage to create a function to apply it on all/selected columns separately.
I would be interested in two solutions:
i) which helps me to apply/loop the ols.PandasRollingOLS function to each column of the dataframe seperately
ii) and the other to solely run the ols.PandasRollingOLS function seperately over columns specified in a list
import pandas as pd
# intialise data
data = {'Market':[0.03, -0.01, 0.01, -0.01],
'Stock1':[0.01, -0.02, 0.03, -0.02],
'Stock2':[0.01, -0.011, 0.01, -0.011],
'Stock3':[0.01, -0.011, 0.01, -0.011],
'Stock4':[0.02, -0.1, 0.02, 0.09],
'Stock5':[-0.01, 0.01, 0.01, 0.005]}
list_stocks = ["Stock1", "Stocks2", "Stock3"]
# Create DataFrame
df = pd.DataFrame(data)
The following code yields the right results, but given the size of the dataframe I cannot work with this solution for the dataset:
#!pip3 install pyfinance
from pyfinance import ols
y=df["Market"]
w = 2
roll_beta["Stock1"] = ols.PandasRollingOLS(y=y, x=df[["Stock1"]], window=w).beta["Stock1"]
roll_beta["Stock2"] = ols.PandasRollingOLS(y=y, x=df[["Stock2"]], window=w).beta["Stock2"]
roll_beta["Stock3"] = ols.PandasRollingOLS(y=y, x=df[["Stock3"]], window=w).beta["Stock3"]
print(roll_beta)
$ Stock1 Stock2 Stock3
1 1.333333 1.904762 1.904762
2 0.400000 0.952381 0.952381
3 0.400000 0.952381 0.952381
You said you'd tried a loop but not what the issue was. Here's a simple loop based on your example - does this work for you?
from pyfinance import ols
y=df["Market"]
w = 2
roll_beta = DataFrame()
for col in df.columns[1:]:
roll_beta[col] = ols.PandasRollingOLS(y=y, x=df[[col]], window=w).beta[col]
print(roll_beta)

How do I calculate standard deviation of two arrays in python?

I have two arrays: one with 30 years of observations, and one with 30 years of historical model runs. I want to calculate the standard deviation between observations and model results, to see how much the model deviates from observations. How do I go about doing this?
Edit
Here are the two arrays (Each number represents a year(1971-2000)):
obs = [ 2790.90283203 2871.02514648 2641.31738281 2721.64453125
2554.19384766 2773.7746582 2500.95825195 3238.41186523
2571.62133789 2421.93017578 2615.80395508 2271.70654297
2703.82275391 3062.25366211 2656.18359375 2593.62231445
2547.87182617 2846.01245117 2530.37573242 2535.79931641
2237.58032227 2890.19067383 2406.27587891 2294.24975586
2510.43847656 2395.32055664 2378.36157227 2361.31689453 2410.75
2593.62915039]
model = [ 2976.01928711 3353.92114258 3000.92700195 3116.5078125 2935.31787109
2799.75805664 3328.06225586 3344.66333008 3318.31689453
3348.85302734 3578.70800781 2791.78198242 4187.99902344
3610.77124023 2991.984375 3112.97412109 4223.96826172
3590.92724609 3284.6015625 3846.34936523 3955.84350586
3034.26074219 3574.46362305 3674.80175781 3047.98144531
3209.56616211 2654.86547852 2780.55053711 3117.91699219
2737.67626953]
You want to compare two signals, e.g. A and B in the following example:
import numpy as np
A = np.random.rand(5)
B = np.random.rand(5)
print "A:", A
print "B:", B
Output:
A: [ 0.66926369 0.63547359 0.5294013 0.65333154 0.63912645]
B: [ 0.17207719 0.26638423 0.55176735 0.05251388 0.90012135]
Analyzing individual signals
The standard deviation of each single signal is not what you need:
print "standard deviation of A:", np.std(A)
print "standard deviation of B:", np.std(B)
Output:
standard deviation of A: 0.0494162021651
standard deviation of B: 0.304319034639
Analyzing the difference
Instead you might compute the difference and apply some common measure like the sum of absolute differences (SAD), the sum of squared differences (SSD) or the correlation coefficient:
print "difference:", A - B
print "SAD:", np.sum(np.abs(A - B))
print "SSD:", np.sum(np.square(A - B))
print "correlation:", np.corrcoef(np.array((A, B)))[0, 1]
Output:
difference: [ 0.4971865 0.36908937 -0.02236605 0.60081766 -0.2609949 ]
SAD: 1.75045448355
SSD: 0.813021824351
correlation: -0.38247081
Use numpy.
import numpy as np
data = [1.2, 2.3, 1.3, 1.2, 5.4]
np.std(data)
Or you could try this:
import numpy as np
obs = np.array([1.2, 2.3, 1.3, 1.2, 5.4])
model = np.array([1.1, 2.4, 1.2, 1.2, 5.3])
np.std(obs-model)
The standard deviation of the same index of multiple lists (e.g. comparing model vs measurement, multiple measurement data etc.. ) as such as
import numpy as np
obs = np.array([0,1,2,3,4])
model = np.array([2,4,6,8,10])
can be calculated by stacking the data into one array:
arr = np.vstack((obs,model))
Now the standard deviation is calculated using np.std() with a specific axis
std = np.std(arr,axis=0)
Alternative one line solution:
std = np.std((model,obs),axis=0)
Output:
[1.0, 1.5, 2.0, 2.5, 3.0]
If you're doing anything more complicated than just finding the standard deviation and/or mean, use numpy/scipy. If that's all you need to do, use the statistics package from the Python Standard Library.
>>> import statistics
>>> statistics.stdev([1, 2, 3])
1.0
It was added in Python 3.4 (see PEP-450) as a lightweight alternative to Numpy for basic stats equations.

Aligning two data sets in Python

I want to develop some python code to align datasets obtained by different instruments recording the same event.
As an example, say I have two sets of measurements:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({'TIME':[1.1, 2.4, 3.2, 4.1, 5.3],\
'VALUE':[10.3, 10.5, 11.0, 10.9, 10.7],\
'ERROR':[0.2, 0.1, 0.4, 0.3, 0.2]})
data2 = pd.DataFrame({'TIME':[0.9, 2.1, 2.9, 4.2],\
'VALUE':[18.4, 18.7, 18.9, 18.8],\
'ERROR':[0.3, 0.2, 0.5, 0.4]})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE, yerr=data2.ERROR, fmt='bo')
plt.show()
The result is plotted here:
What I would like to do now is to align the second dataset (data2) to the first one (data1). i.e. to get this:
The second dataset must be shifted to match the first one by subtracting a constant (to be determined) from all its values. All I know is that the datasets are correlated since the two instruments are measuring the same event but with different sampling rates.
At this stage I do not want to make any assumptions about what function best describes the data (fitting will be done after alignment).
I am cautious about using means to perform shifts since it may produce bad results, depending on how the data is sampled. I was considering taking each data2[TIME_i] and working out the shortest distance to data1[~TIME_i]. Then minimizing the sum of those. But I am not sure that would work well either.
Does anyone have any suggestions on a good method to use? I looked at mlpy but it seems to only work on 1D arrays.
Thanks.
You can substract the mean of the difference: data2.VALUE-(data2.VALUE - data1.VALUE).mean()
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({
'TIME': [1.1, 2.4, 3.2, 4.1, 5.3],
'VALUE': [10.3, 10.5, 11.0, 10.9, 10.7],
'ERROR': [0.2, 0.1, 0.4, 0.3, 0.2],
})
data2 = pd.DataFrame({
'TIME': [0.9, 2.1, 2.9, 4.2],
'VALUE': [18.4, 18.7, 18.9, 18.8],
'ERROR': [0.3, 0.2, 0.5, 0.4],
})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE-(data2.VALUE - data1.VALUE).mean(),
yerr=data2.ERROR, fmt='bo')
plt.show()
Another possibility is to subtract the mean of each series
You can calculate the offset of the average and subtract that from every value. If you do this for every value they should align relatively well. This would assume both dataset look relatively similar, so it might not work the best.
Although this question is not Matlab related, you might still be interested in this:
Remove unknown DC Offset from a non-periodic discrete time signal

Categories

Resources