I have the following time:
1 days 04:05:33.623000
Is it possible, to convert the time in milliseconds?
Like this:
101133623.0
It's certainly possible.
1 day + 4 hours + 5 minutes + 33 seconds + 623 milliseconds =
24 * 60 * 60 seconds + 4 * 60 * 60 seconds + 5 * 60 seconds + 33 seconds + 0.623 seconds =
86400 seconds + 14400 seconds + 300 seconds + 33 seconds + 0.623 seconds =
101133.623 seconds
Just use multiplication
Function Below:
def timestamp_to_milliseconds(timestamp):
day, hour, minute, second, millisecond = timestamp.split(":")
seconds = int(day) * 24 * 60 * 60 + int(hour) * 60 * 60 + int(minute) * 60 + int(second)
seconds += float(millisecond) / 1000
milliseconds = seconds * 1000
return milliseconds
I found a possibility to solve the problem
result = timestamp.total_seconds()*1000
print(result)
101133623.0
I'm trying to calculate daily returns using the time weighted rate of return formula:
(Ending Value-(Beginning Value + Net Additions)) / (Beginning value + Net Additions)
My DF looks like:
Account # Date Balance Net Additions
1 9/1/2022 100 0
1 9/2/2022 115 10
1 9/3/2022 117 0
2 9/1/2022 50 0
2 9/2/2022 52 0
2 9/3/2022 40 -15
It should look like:
Account # Date Balance Net Additions Daily TWRR
1 9/1/2022 100 0
1 9/2/2022 115 10 0.04545
1 9/3/2022 117 0 0.01739
2 9/1/2022 50 0
2 9/2/2022 52 0 0.04
2 9/3/2022 40 -15 0.08108
After calculating the daily returns for each account, I want to link all the returns throughout the month to get the monthly return:
((1 + return) * (1 + return)) - 1
The final result should look like:
Account # Monthly Return
1 0.063636
2 0.12432
Through research (and trial and error), I was able to get the output I am looking for but as a new python user, I'm sure there is an easier/better way to accomplish this.
DF["Numerator"] = DF.groupby("Account #")[Balance].diff() - DF["Net Additions"]
DF["Denominator"] = ((DF["Numerator"] + DF["Net Additions"] - DF["Balance"]) * -1) + DF["Net Additions"]
DF["Daily Returns"] = (DF["Numerator"] / DF["Denominator"]) + 1
DF = DF.groupby("Account #")["Daily Returns"].prod() - 1
Any help is appreciated!
The installment amount is calculated by the formula below.
I have a dataframe where I have the principal amount (P), installment amount and number of payments (n) in different columns and I wish to calculate the interest rate (i) for all rows.
Principal (P)
Installment Amount
Number of Installments (n)
Interest Rate (i)
5.300
187
35
r
Given a dataframe called df
>>> df
Principal Installment Num Payments
0 1000.0 40.0 30
1 3500.0 200.0 20
2 10000000.0 2000000.0 10
and a function interest using some solving method (in below example, Newton-Raphson)
ERROR_TOLERANCE = 1e-6
def interest(principal, installment, num_payments):
def f(x):
return principal * x**(num_payments + 1) - (principal + installment) * x**num_payments + installment
def f_prime(x):
return principal * (num_payments + 1) * x**num_payments - (principal + installment)*num_payments * x**(num_payments - 1)
guess = 1 + (((installment * num_payments / principal) - 1)/12)
intermediate = f(guess)
while abs(intermediate) > ERROR_TOLERANCE:
guess = guess - intermediate / f_prime(gues
intermediate = f(guess)
return guess
you can calculate the interest rate like
df['Interest'] = df.apply(lambda row: interest(row['Principal'],row['Installment'],row['Num Payments']),axis=1)
giving
>>> df
Principal Installment Num Payments Interest
0 1000.0 40.0 30 1.012191
1 3500.0 200.0 20 1.013069
2 10000000.0 2000000.0 10 1.150984
Note: tweak ERROR_TOLERANCE as desired to meet requirements.
I would like to compute the distance between two coordinates. I know I can compute the haversine distance between two points. However, I was wondering if there is an easier way of doing it instead of creating a loop using the formula iterating over the entire columns (also getting errors in the loop).
Here's some data for the example
# Random values for the duration from one point to another
random_values = random.sample(range(2,20), 8)
random_values
# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]
df = pd.DataFrame(
{'duration': random_values,
'latitude': lat_coor,
'longitude': lon_coor
})
df
duration latitude longitude
0 5 11.923855 57.723843
1 2 11.923862 57.723831
2 10 11.923851 57.723839
3 19 11.923847 57.723831
4 16 11.923865 57.723827
5 4 11.923841 57.723831
6 13 11.923860 57.723835
7 3 11.923846 57.723827
To compute the distance this is what I've attempted:
# Looping over each row to compute the Haversine distance between two points
# Earth's radius (in m)
R = 6373.0 * 1000
lat = df["latitude"]
lon = df["longitude"]
for i in lat:
lat1 = lat[i]
lat2 = lat[i+1]
for j in lon:
lon1 = lon[i]
lon2 = lon[i+1]
dlon = lon2 - lon1
dlat = lat2 - lat1
# Haversine formula
a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c
print(distance) # in m
However, this is the error I get:
The two points to compute the distance should be taken from the same column.
first distance value:
11.923855 57.723843 (point1/observation1)
11.923862 57.723831 (point2/observation2)
second distance value:
11.923862 57.723831 (point1/observation2)
11.923851 57.723839(point2/observation3)
third distance value:
11.923851 57.723839(point1/observation3)
11.923847 57.723831 (point1/observation4)
... (and so on)
OK, first you can create a dataframe that combine each measurement with the previous one:
df2 = pd.concat([df.add_suffix('_pre').shift(), df], axis=1)
df2
This outputs:
duration_pre latitude_pre longitude_pre duration latitude longitude
0 NaN NaN NaN 5 11.923855 57.723843
1 5.0 11.923855 57.723843 2 11.923862 57.723831
2 2.0 11.923862 57.723831 10 11.923851 57.723839
…
Then create a haversine function and apply it to the rows:
def haversine(lat1, lon1, lat2, lon2):
import math
R = 6373.0 * 1000
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
return R *2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
df2.apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)
which computes for each row the distance with the previous row (first one is thus NaN).
0 NaN
1 75.754755
2 81.120210
3 48.123604
…
And, if you want to include a new column in the original dataframe in one line:
df['distance'] = pd.concat([df.add_suffix('_pre').shift(), df], axis=1).apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)
Output:
duration latitude longitude distance
0 5 11.923855 57.723843 NaN
1 2 11.923862 57.723831 75.754755
2 10 11.923851 57.723839 81.120210
3 19 11.923847 57.723831 48.123604
4 16 11.923865 57.723827 116.515304
5 4 11.923841 57.723831 154.307571
6 13 11.923860 57.723835 122.794838
7 3 11.923846 57.723827 98.115312
I understood that you want to get the pairwise haversine distance between all points in your df. Here's how this could be done:
Be careful when using this approach with a lot of points as it generates a lot of columns quickly
Setup
import random
random_values = random.sample(range(2,20), 8)
random_values
# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]
df = pd.DataFrame(
{'duration': random_values,
'latitude': lat_coor,
'longitude': lon_coor
})
Get radians
import math
df['lat_rad'] = df.latitude.apply(math.radians)
df['long_rad'] = df.latitude.apply(math.radians)
Calculate pairwise distances
from sklearn.metrics.pairwise import haversine_distances
for idx_from, from_point in df.iterrows():
for idx_to, to_point in df.iterrows():
column_name = f"Distance_to_point_{idx_from}"
haversine_matrix = haversine_distances([[from_point.lat_rad, from_point.long_rad], [to_point.lat_rad, to_point.long_rad]])
point_distance = haversine_matrix[0][1] * 6371000/1000
df.loc[idx_to, column_name] = point_distance
df
duration latitude longitude lat_rad long_rad Distance_to_point_0 Distance_to_point_1 Distance_to_point_2 Distance_to_point_3 Distance_to_point_4 Distance_to_point_5 Distance_to_point_6 Distance_to_point_7
0 3 11.923855 57.723843 0.20811052928038845 0.20811052928038845 0.0 0.0010889626934743966 0.0006222644021223135 0.001244528808978787 0.0015556609862946524 0.002177925427923575 0.000777830496776312 0.0014000949117650525
1 13 11.923862 57.723831 0.2081106514534361 0.2081106514534361 0.0010889626934743966 0.0 0.0017112270955967099 0.002333491502453183 0.0004666982928202561 0.00326688812139797 0.00031113219669808446 0.0024890576052394482
2 14 11.923851 57.723839 0.2081104594672184 0.2081104594672184 0.0006222644021223135 0.0017112270955967099 0.0 0.0006222644068564735 0.002177925388416966 0.0015556610258012616 0.0014000948988986254 0.0007778305096427389
3 4 11.923847 57.723831 0.20811038965404832 0.20811038965404832 0.001244528808978787 0.002333491502453183 0.0006222644068564735 0.0 0.0028001897952734385 0.0009333966189447881 0.002022359305755099 0.0001555661027862654
4 5 11.923865 57.723827 0.20811070381331365 0.20811070381331365 0.0015556609862946524 0.0004666982928202561 0.002177925388416966 0.0028001897952734385 0.0 0.003733586414218225 0.0007778304895183407 0.002955755898059704
5 7 11.923841 57.723831 0.20811028493429318 0.20811028493429318 0.002177925427923575 0.00326688812139797 0.0015556610258012616 0.0009333966189447881 0.003733586414218225 0.0 0.002955755924699886 0.0007778305161585227
6 9 11.92386 57.723835 0.20811061654685106 0.20811061654685106 0.000777830496776312 0.00031113219669808446 0.0014000948988986254 0.002022359305755099 0.0007778304895183407 0.002955755924699886 0.0 0.002177925408541364
7 8 11.923846 57.723827 0.20811037220075576 0.20811037220075576 0.0014000949117650525 0.0024890576052394482 0.0007778305096427389 0.0001555661027862654 0.002955755898059704 0.0007778305161585227 0.002177925408541364 0.0
You are confusing the index versus the values themselves, so you are getting a key error because there is no lat[i] (e.g., lat[11.923855]) in your example. After fixing i to be the index, your code would go beyond the last row of lat and lon with your [i+1]. Since you want to compare each row to the previous row, how about starting at index 1 and looking back by 1, then you won't go out of range. This edited version of your code does not crash:
for i in range(1, len(lat)):
lat1 = lat[i - 1]
lat2 = lat[i]
for j in range(1, len(lon)):
lon1 = lon[i - 1]
lon2 = lon[i]
dlon = lon2 - lon1
dlat = lat2 - lat1
# Haversine formula
a = math.sin(dlat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c
print(distance) # in m
I am trying to plot a number of bar charts with matplotlib having exactly 26 timestamps / slots at the x-axis and two integers for the y-axis. For most data sets this scales fine, but in some cases matplotlib lets the bars overlap:
Left overlapping and not aligned to xticks, right one OK
Overlapping
So instead of giving enough space for the bars they are overlapping although my width is set to 0.1 and my datasets have 26 values, which I checked.
My code to plot these charts is as follows:
# Plot something
rows = len(data_dict) // 2 + 1
fig = plt.figure(figsize=(15, 5*rows))
gs1 = gridspec.GridSpec(rows, 2)
grid_x = 0
grid_y = 0
for dataset_name in data_dict:
message1_list = []
message2_list = []
ts_list = []
slot_list = []
for slot, counts in data_dict[dataset_name].items():
slot_list.append(slot)
message1_list.append(counts["Message1"])
message2_list.append(counts["Message2"])
ts_list.append(counts["TS"])
ax = fig.add_subplot(gs1[grid_y, grid_x])
ax.set_title("Activity: " + dataset_name, fontsize=24)
ax.set_xlabel("Timestamps", fontsize=14)
ax.set_ylabel("Number of messages", fontsize=14)
ax.xaxis_date()
hfmt = matplotdates.DateFormatter('%d.%m,%H:%M')
ax.xaxis.set_major_formatter(hfmt)
ax.set_xticks(ts_list)
plt.setp(ax.get_xticklabels(), rotation=60, ha='right')
ax.tick_params(axis='x', pad=0.75, length=5.0)
rects = ax.bar(ts_list, message2_list, align='center', width=0.1)
rects2 = ax.bar(ts_list, message1_list, align='center', width=0.1, bottom=message2_list)
# update grid position
if (grid_x == 1):
grid_x = 0
grid_y += 1
else:
grid_x = 1
plt.tight_layout(0.01)
plt.savefig(r"output_files\activity_barcharts.svg",bbox_inches='tight')
plt.gcf().clear()
The input data looks as follows (example of a plot with overlapping bars, second picture)
slot - message1 - message2 - timestamp
0 - 0 - 42 - 2017-09-11 07:59:53.517000+00:00
1 - 0 - 4 - 2017-09-11 09:02:28.827875+00:00
2 - 0 - 0 - 2017-09-11 10:05:04.138750+00:00
3 - 0 - 0 - 2017-09-11 11:07:39.449625+00:00
4 - 0 - 0 - 2017-09-11 12:10:14.760500+00:00
5 - 0 - 0 - 2017-09-11 13:12:50.071375+00:00
6 - 0 - 13 - 2017-09-11 14:15:25.382250+00:00
7 - 0 - 0 - 2017-09-11 15:18:00.693125+00:00
8 - 0 - 0 - 2017-09-11 16:20:36.004000+00:00
9 - 0 - 0 - 2017-09-11 17:23:11.314875+00:00
10 - 0 - 0 - 2017-09-11 18:25:46.625750+00:00
11 - 0 - 0 - 2017-09-11 19:28:21.936625+00:00
12 - 0 - 0 - 2017-09-11 20:30:57.247500+00:00
13 - 0 - 0 - 2017-09-11 21:33:32.558375+00:00
14 - 0 - 0 - 2017-09-11 22:36:07.869250+00:00
15 - 0 - 0 - 2017-09-11 23:38:43.180125+00:00
16 - 0 - 0 - 2017-09-12 00:41:18.491000+00:00
17 - 0 - 0 - 2017-09-12 01:43:53.801875+00:00
18 - 0 - 0 - 2017-09-12 02:46:29.112750+00:00
19 - 0 - 0 - 2017-09-12 03:49:04.423625+00:00
20 - 0 - 0 - 2017-09-12 04:51:39.734500+00:00
21 - 0 - 0 - 2017-09-12 05:54:15.045375+00:00
22 - 0 - 0 - 2017-09-12 06:56:50.356250+00:00
23 - 0 - 0 - 2017-09-12 07:59:25.667125+00:00
24 - 0 - 20 - 2017-09-12 09:02:00.978000+00:00
25 - 0 - 0 - 2017-09-12 10:04:36.288875+00:00
Does anyone know how to prevent this from happening?
I calculated exactly 26 bars for every chart and actually expected them to have equally width. I also tried to replace the 0 with 1e-5, but that did not prevent any overlapping (which another post proposed).
The width of the bar is the width in data units. I.e. if you want to have a bar of width 1 minute, you would set the width to
plt.bar(..., width=1./(24*60.))
because the numeric axis unit for datetime axes in matplotlib is days and there are 24*60 minutes in a day.
For an automatic determination of the bar width, you may say that you want to have the bar width the smallest difference between any two successive values from the input time list. In that case, something like the following will do the trick
import numpy as np
import matplotlib.pyplot as plt
import datetime
import matplotlib.dates
t = [datetime.datetime(2017,9,12,8,i) for i in range(60)]
x = np.random.rand(60)
td = np.diff(t).min()
s1 = matplotlib.dates.date2num(datetime.datetime.now())
s2 = matplotlib.dates.date2num(datetime.datetime.now()+td)
plt.bar(t, x, width=s2-s1, ec="k")
plt.show()