I'm trying to write a python script where a line must be executed a defined number of times per second (corresponding to FPS). I'm having a precision problem that I guess comes from the time.time_ns() as it increases when I go into high tick per second values.
If I put for example 240 ticks per second, my ticks per second counter will display 200 - 250 -250 -250 - 200. Impossible to have values between 200 and 250. Same thing, at lower values between 112 and 125 (I would have liked 120).
What is wrong with my logic, how can I improve it?
fps = 120
timeFrame_ns = 1000000000/fps
timeTickCounter = 0
count = 0
elapsedTime= time.time_ns()
while True:
if (time.time_ns() > timeTickCounter + 1000000000):
print("Ticks per second %s " % count)
if count != fps and count !=0:
diff = fps-count
ratio = diff/count
timeFrame_ns -= timeFrame_ns*ratio
timeTickCounter = time.time_ns()
count = 0
count += 1
while True:
if timeFrame_ns <= (time.time_ns()-elapsedTime):
break
elapsedTime= time.time_ns()
This script will print :
Ticks per second 0
Ticks per second 112
Ticks per second 125
Ticks per second 112
Ticks per second 125
Ticks per second 125
Ticks per second 112
Ticks per second 125
Ticks per second 125
Ticks per second 112
Ticks per second 125
Related
The dataframe(contains data on the 2016 elections), loaded in pandas from a .csv has the following structure:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
The idea would be to calculate the first X counties with the highest percentage of votes in favor of candidate X (removing Totals)
For example suppose we want 100 counties, and the candidate is Trump, the operation to be carried out is: 100 * sum of votes for Trump / total votes
I have implemented the following code, getting correct results:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
Using %%time i realized that it is quite slow:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
Is there a way to make it faster?
You can try to amend your codes to use only vectorized operations to speed up the process, like below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
Edit:
If you want the top 100 counties with the highest percentages, you can slightly change the codes below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate
you can try:
Supposing you don't have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
Supposing you have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
I could not test it because I don`t have the data but it doesn't use any apply which could increase the performance
for the rename of the columns you can use .rename(columns={'votes':'percentage'}) at the end
My algorithm stepped up from 35 seconds to 15 minutes runtime when implementing this feature over a daily timeframe. The algo retrieves daily history in bulk and iterates over a subset of the dataframe (from t0 to tX where tX is the current row of iteration). It does this to emulate what would happen during the real time operations of the algo. I know there are ways of improving it by utilizing memory between frame calculations but I was wondering if there was a more pandas-ish implementation that would see immediate benefit.
Assume that self.Step is something like 0.00001 and self.Precision is 5; they are used for binning the ohlc bar information into discrete steps for the sake of finding the poc. _frame is a subset of rows of the entire dataframe, and _low/_high are respective to that. The following block of code executes on the entire _frame which could be upwards of ~250 rows every time there is a new row added by the algo (when calculating yearly timeframe on daily data). I believe it's the iterrows that's causing the major slowdown. The dataframe has columns such as high, low, open, close, volume. I am calculating time price opportunity and volume point of control.
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - self.Step, _high + self.Step, self.Step), decimals=self.Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.low, state.high, self.Step), decimals=self.Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax()) / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax()) / 2)
you can use this function as a base and to adjust it:
def f(x): #function to find the POC price and volume
a = x['tradePrice'].value_counts().index[0]
b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
return pd.Series([a,b],['POC_Price','POC_Volume'])
Here's what I worked out. I'm still not sure the answer you code is producing is correct, I think your line volume_prices[_prices] += state.Volume / _prices.size is not being applied to every record in volume_prices, but here it is with benchmarking. About a 9x improvement.
def vpOriginal():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
time_prices2 = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.Low, state.High, Step), decimals=Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.Volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
time_prices2 += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
# print(volume_prices.head(10))
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
def vpNoDF():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around((state.High - state.Low) / Step , 0)
# Evenly distribute the bar's volume over its range
volume_prices.loc[state.Low:state.High] += state.Volume / _prices
# Increment time at price
time_prices.loc[state.Low:state.High] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
getData()
Out[8]:
Date Open High Low Close Volume Adj Close
0 2008-10-14 116.26 116.40 103.14 104.08 70749800 104.08
1 2008-10-13 104.55 110.53 101.02 110.26 54967000 110.26
2 2008-10-10 85.70 100.00 85.00 96.80 79260700 96.80
3 2008-10-09 93.35 95.80 86.60 88.74 57763700 88.74
4 2008-10-08 85.91 96.33 85.68 89.79 78847900 89.79
5 2008-10-07 100.48 101.50 88.95 89.16 67099000 89.16
6 2008-10-06 91.96 98.78 87.54 98.14 75264900 98.14
7 2008-10-03 104.00 106.50 94.65 97.07 81942800 97.07
8 2008-10-02 108.01 108.79 100.00 100.10 57477300 100.10
9 2008-10-01 111.92 112.36 107.39 109.12 46303000 109.12
vpOriginal()
Out[9]: (142.55000000000001, 142.55000000000001)
vpNoDF()
Out[10]: (142.55000000000001, 142.55000000000001)
%timeit vpOriginal()
2.79 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vpNoDF()
300 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've managed to get it down to 2 mins instead of 15 - at on daily timeframes anyway. It's still slow on lower timeframes (10 minutes on Hourly over a 2 year period with a precision of 2 for equities). Working with DataFrames as opposed to Series was FAR slower. I'm hoping for more but I don't know what I can do aside from the following solution:
# Upon class instantiation, I've created attributes for each timeframe
# related to `volume_at_price` and `time_at_price`. They serve as memory
# in between frame calculations
def _prices_at(self, frame, bars=0):
# Include 1 step above high as np.arange does not
# include the upper limit by default
state = frame.iloc[-min(bars + 1, frame.index.size)]
bins = np.around(np.arange(state.low, state.high + self.Step, self.Step), decimals=self.Precision)
return pd.Series(state.volume / bins.size, index=bins)
# SetFeature/Feature implement timeframed attributes (i.e., 'volume_at_price_D')
_v = 'volume_at_price'
_t = 'time_at_price'
# Add to x_at_price histogram
_p = self._prices_at(frame)
self.SetFeature(_v, self.Feature(_v).add(_p, fill_value=0))
self.SetFeature(_t, self.Feature(_t).add(_p * 0 + 1, fill_value=0))
# Remove old data from histogram
_p = self._prices_at(frame, self.Bars)
v = self.SetFeature(_v, self.Feature(_v).subtract(_p, fill_value=0))
t = self.SetFeature(_t, self.Feature(_t).subtract(_p * 0 + 1, fill_value=0))
self.SetFeature('volume_poc', (v.idxmax() + v.iloc[::-1].idxmax()) / 2)
self.SetFeature('time_poc', (t.idxmax() + t.iloc[::-1].idxmax()) / 2)
I'm working with a large flight delay dataset trying to predict the flight delay based on multiple new features. Based on a plane's tailnumber, I want to count the number of flights and sum the total airtime the plane has done in the past X (to be specified) hours/days to create a new "usage" variable.
Example of data (excluded airtime:
ID tail_num deptimestamp dep_delay distance air_time
2018-11-13-1659_UA2379 N14118 13/11/2018 16:59 -3 2425 334
2018-11-09-180_UA275 N13138 09/11/2018 18:00 -3 2454 326
2018-06-04-1420_9E3289 N304PQ 04/06/2018 14:20 -2 866 119
2018-09-29-1355_WN3583 N8557Q 29/09/2018 13:55 -5 762 108
2018-05-03-815_DL2324 N817DN 03/05/2018 08:15 0 1069 138
2018-01-12-1850_NK347 N635NK 12/01/2018 18:50 100 563 95
2018-09-16-1340_OO4721 N242SY 16/09/2018 13:40 -3 335 61
2018-06-06-1458_DL2935 N351NB 06/06/2018 14:58 1 187 34
2018-06-25-1030_B61 N965JB 25/06/2018 10:30 48 1069 143
2018-12-06-1215_MQ3617 N812AE 06/12/2018 12:15 -9 427 76
Example output for give = 'all' (not based on example data):
2018-12-31-2240_B61443 (1, 152.0, 1076.0, 18.0)
I've written a function to be applied to each row that filters the dataframe for flights with the same tail number and within the specified time frame and then gives back either the number of flights/total airtime or a dataframe containing the flights in question. It works but take a long time (around 3 hours calculating for a subset of 400k flights but filtering the entire dataset of over 7m rows). Is there a way to speed this up?
def flightsbefore(ID,
give = 'number',
direction = 'before',
seconds = 0,
minutes = 0,
hours = 0,
days = 0,
weeks = 0,
months = 0,
years = 0):
""" Takes the ID of a flight and a time unit to return the flights of that plane within that timeframe"""
tail_num = dfallcities.loc[ID,'tail_num']
date = dfallcities.loc[ID].deptimestamp
#dfallcities1 = dfallcities[(dfallcities.a != -1) & (dfallcities.b != -1)]
if direction == 'before':
timeframe = dfallcities.loc[ID].deptimestamp - datetime.timedelta(seconds = seconds,
minutes = minutes,
hours = hours,
days = days,
weeks = weeks)
output = dfallcities[(dfallcities.tail_num == tail_num) & \
(dfallcities.deptimestamp >= timeframe) & \
(dfallcities.deptimestamp < date)]
else:
timeframe = dfallcities.loc[ID].deptimestamp + datetime.timedelta(seconds = seconds,
minutes = minutes,
hours = hours,
days = days,
weeks = weeks)
output = dfallcities[(dfallcities.tail_num == tail_num) &
(dfallcities.depTimestamp <= timeframe) &
(dfallcities.deptimestamp >= date)]
if give == 'number':
return output.shape[0]
elif give == 'all':
if output.empty:
prev_delay = 0
else:
prev_delay = np.max((output['dep_delay'].iloc[-1],0))
return (output.shape[0], output['air_time'].sum(),output['distance'].sum(), prev_delay)
elif give == 'flights':
return output.sort_values('deptimestamp')
else:
raise ValueError("give must be one of [number, all, flights]")
No errors but simply very slow
I'm reading datas from a file. content: time-id-data,
when I run on MAC, it works well, but on linux, sometimes it works sometimes fails.
the error is "IndexError: list index out of range"
data like this:
'
1554196690 0.0 178 180 180 178 178 178 180
1554196690 0.1 178 180 178 180 180 178 178
1554196690 0.2 175 171 178 173 173 178 172
1554196690 0.3 171 175 175 17b 179 177 17e
1554196691 0.4 0 d3
1554196691 0.50 28:10:4:92:a:0:0:d6 395
1554196691 0.58 28:a2:23:93:a:0:0:99 385
'
data = []
boardID=100 #how many lines at most in datafile
for i in range(8):
data.append([[] for x in range(8)])#5 boards,every boards have 7 sensors add 1-boardID
time_stamp = []
time_xlabel=[]
time_second=[]
for i in range(8):
time_stamp.append([]) #5th-lines data is the input voltage and pressure
time_xlabel.append([])#for x label
time_second.append([])#time from timestamp to time(start-time is 0)
with open("Potting_20190402-111807.txt","r") as eboardfile:
for line in eboardfile:
values = line.strip().split("\t")
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
time_stamp[boardID].append(int(values[0]))
if boardID >= 0 and boardID < 4:
for i in range(2,9):
data[boardID][i-2].append(int(values[i],16) * 0.0625)
if boardID==4:#pressure
data[boardID][0].append( int(values[2],16) * 5./1024. *14.2/2.2) #voltage divider: 12k + 2.2k
data[boardID][1].append( (int(values[3],16) * 5./1024. - 0.5) / 4.*6.9*1000.) #adc to volt: value * 5V/1024, volt to hpa: (Vout - 0.5V)/4V *6.9bar * 1000i
elif boardID > 4 and boardID < 7: #temperature sensor located inside house not no electronBoards
data[boardID][0].append(int(values[4],10) * 0.0625)#values[2] is the address,[3]-empty;[4]is the valueself.
eboardfile.close()
Traceback(most recent call last):
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
IndexError: list index out of range
Traceback(most recent call last):
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
IndexError: list index out of range
this error occurs when your values has element less than one, which means values = line.strip().split("\t"), the line has no \t at all.
maybe a empty line? or linux format problem.
you can check the len of values before use:
if len(values) < 9:
continue
or try this:
import string
values = line.strip().split(string.whitespace)
can not reproduce your condition, so just have a try.
I have some real rainfall data recorded as the date and time, and the accumulated number of tips on a tipping bucket rain-gauge. The tipping bucket represents 0.5mm of rainfall.
I want to cycle through the file and determine the variation in intensity (rainfall/time)
So I need a rolling average over multiple fixed time frames:
So I want to accumulate rainfall, until 5minutes of rain is accumulated and determine the intensity in mm/hour. So if 3mm is recorded in 5min it is equal to 3/5*60 = 36mm/hr.
the same rainfall over 10 minutes would be 18mm/hr...
So if I have rainfall over several hours I may need to review at several standard intervals of say: 5, 10,15,20,25,30,45,60 minutes etc...
Also the data is recorded in reverse order in the raw file, so the earliest time is at the end of the file and the later and last time step appears first after a header:
Looks like... (here 975 - 961 = 14 tips = 7mm of rainfall) average intensity 1.4mm/hr
But between 16:27 and 16:34 967-961 = 6 tips = 3mm in 7 min = 27.71mm/hour
7424 Figtree (O'Briens Rd)
DATE :hh:mm Accum Tips
8/11/2011 20:33 975
8/11/2011 20:14 974
8/11/2011 20:04 973
8/11/2011 20:00 972
8/11/2011 19:35 971
8/11/2011 18:29 969
8/11/2011 16:44 968
8/11/2011 16:34 967
8/11/2011 16:33 966
8/11/2011 16:32 965
8/11/2011 16:28 963
8/11/2011 16:27 962
8/11/2011 15:30 961
Any suggestions?
I am not entirely sure what it is that you have a question about.
Do you know how to read out the file? You can do something like:
data = [] # Empty list of counts
# Skip the header
lines = [line.strip() for line in open('data.txt')][2::]
for line in lines:
print line
date, hour, count = line.split()
h,m = hour.split(':')
t = int(h) * 60 + int(m) # Compute total minutes
data.append( (t, int(count) ) ) # Append as tuple
data.reverse()
Since your data is cumulative, you need to subtract each two entries, this is where
python's list comprehensions are really nice.
data = [(t1, d2 - d1) for ((t1,d1), (t2, d2)) in zip(data, data[1:])]
print data
Now we need to loop through and see how many entries are within the last x minutes.
timewindow = 10
for i, (t, count) in enumerate(data):
# Find the entries that happened within the last [...] minutes
withinwindow = filter( lambda x: x[0] > t - timewindow, data )
# now you can print out any kind of stats about this "within window" entries
print sum( count for (t, count) in withinwindow )
Since the time stamps do not come at regular intervals, you should use interpolating to get the most accurate results. This will make the rolling average easier too. I'm using the Interpolate class in this answer in the below code.
from time import strptime, mktime
totime = lambda x: int(mktime(strptime(x, "%d/%m/%Y %H:%M")))
with open("my_file.txt", "r") as myfile:
# Skip header
for line in myfile:
if line.startswith("DATE"):
break
times = []
values = []
for line in myfile:
date, time, value = line.split()
times.append(totime(" ".join((date, time))))
values.append(int(value))
times.reverse()
values.reverse()
i = Interpolate(times, values)
Now it's just a matter of choosing your intervals and computing the difference between the endpoints of each interval. Let's create a generator function for that:
def rolling_avg(cumulative_lookup, start, stop, step_size, window_size):
for t in range(start + window_size, stop, step_size):
total = cumulative_lookup[t] - cumulative_lookup[t - window_size]
yield total / window_size
Below I'm printing the number of tips per hour in the previous hour with 10 minute intervals:
start = totime("8/11/2011 15:30")
stop = totime("8/11/2011 20:33")
for avg in rolling_avg(i, start, stop, 600, 3600):
print avg * 3600
EDIT: Made totime return an int and created the rolling_avg generator.