I have a pandas DataFrame which record the accumulated network traffic (bytes) from several programs at certain (but not constant) interval. It is like the "all time download / upload" data in some programs. The DataFrame is constantly renewing, some columns are deleted and some are added. The index is pandas.DatetimeIndex.
Looks like this:
Program_A Program_B Program_C
2020-10-21 19:30:01.352301 100 200 NaN
2020-10-21 19:45:01.245997 200 250 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-10-22 17:30:01.123456 10000 700 NaN
2020-10-22 17:45:01.158689 30000 700 NaN
2020-10-22 18:00:01.191560 50000 700 NaN
2020-10-22 18:15:01.208001 70000 700 NaN
2020-10-22 18:20:28.401580 100000 700 5000
2020-10-22 18:30:01.281731 110000 700 200000
Every time I retrieve the DataFrame, I want to calculate the average traffic speed (byte/sec) for every column in the last hour. I want something like this:
last_hour_avgspeed(myDataFrame)
-->
Program_A 27.7
Program_B 0.0
Program_C 325.0
......
dtype: float64
There could be NaNs in the data because some columns are added within an hour. So a simple (last row - first row) / 3600 would not work.
I'm new to pandas. I first wrote a function:
def avgspeed(series: pd.Series):
lo = series.first_valid_index()
hi = series.last_valid_index()
s = series[hi] - series[lo]
t = (hi - lo).total_seconds()
return s // t if t > 0 else np.nan
Then apply this to every column:
myDataFrame.last('H').apply(avgspeed)
I believe this do give the correct result: a pandas Series of column-speed pairs. However, I'm feeling this must not be the best way. Where is the vectorization? Can we get the result in one hit?
I have tried another method:
myDataFrame.last("H").resample("T").bfill().diff().mean().floordiv(60)
First resample the data to 1-min samples (not 1s because too slow), then calculate the mean of the differences, then divide it by 60 seconds... I think this is more silly than the first method. But the performance is actually two times faster than the first one. However, the result of columns containing NaN is somewhat different from the first one. It could because the bfill method brought some problems, I think.
So, what is the correct way to do the calculation?
I couldn't understand well but i think you need to use groupby and aggregate
df.groupby().agg(column='' , aggfunc=mean)
you can read more info : https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
Well, I think I figured out a vectorized solution:
def get_avgspeed(df: pd.DataFrame, offset: str) -> pd.Series:
"""Calculate average traffic speed in the final period of time based on offset."""
# truncate the dataframe to the last N time units
df = df.truncate(before=(pd.Timestamp.now()- pd.Timedelta(offset)), copy=False)
# calculate the difference between the last and the first valid index per row
t = df.index[-1] - df.apply(pd.Series.first_valid_index)
# df.bfill() will back fill NaNs, so iloc[0] can give us the first valid value per row
# then calculate the value difference: df.iloc[-1] - df.bfill().iloc[0]
# return the speed: v = s/t
return (df.iloc[-1] - df.bfill().iloc[0]) // t.dt.total_seconds()
avg_speed(myDataFrame, '24H')
-->
Program_A 247.0
Program_B 16.0
Program_C 197620.0
Program_X 252943.0
... ...
dtype: float64
In my test, the performance seems to be a little better than the no vectorized version in the question: 1.77ms vs 2.1ms.
Related
I am trying to add together elements in the second column from from two dataframes where the time(in the first column) is the same, however the time in each DataFrame is spaced at different intervals. So, in the image below, I would like to add the y values of both lines together:
enter image description here
So where they overlap, the combined value would be at around 3200.
Each dataframe is two columns: first one is time in unix timestamp, and the second column is power in watts, and the spacing between each row is usually 6 seconds, but sometimes more or less. Also, each dataframe starts and ends at a different time, although there is some overlap in the inner portion.
I've added the first few rows for ease of viewing:
df1:
time power
0 1355526770 1500
1 1355526776 1800
2 1355526782 1600
3 1355526788 1700
4 1355526794 1400
df2:
time power
0 1355526771 1250
1 1355526777 1200
2 1355526783 1280
3 1355526789 1290
4 1355526795 1300
I first though to reindex each dataframe inserting a row for every second across the time range of each df, and then linearly interpolating the power value data between each time. Then I would add together the dataframes by adding the power value where the timestamp matched exactly.
The problem with this method is that it would increase the size of each dataframe by at least 6x, and since they're already pretty big, this would slow things down a lot.
If anyone knows another method to do this I would be very grateful.
Beyond what the other users have said, you could also consider trying out Modin instead of pure pandas for your datasets if you want another way to speed up computation and so forth. Modin is easily integrated with your system with just one line of code. Take a look here: IntelĀ® Distribution of Modin
Using a merge_asof to align on the nearest time:
(pd.merge_asof(df1, df2, on='time', direction='nearest', suffixes=(None, '_2'))
.assign(power=lambda d: d['power'].add(d.pop('power_2')))
)
Output:
time power
0 1355526770 2750
1 1355526776 3000
2 1355526782 2880
3 1355526788 2990
4 1355526794 2700
I start with a large list of all Bitcoin prices. I import it into a Dataframe.
df.head()
BTC-USDT_close
open_time
2021-11-05 22:28:00 61151.781250
2021-11-05 22:27:00 61199.011719
2021-11-05 22:26:00 61201.398438
2021-11-05 22:25:00 61237.828125
2021-11-05 22:24:00 61195.578125
...
221651 rows total.
What I need is the following:
For each row in this dataframe
take next 60 values
take next 60 in every 5 values
take next 60 in every 15 values
take next 60 in every 60 values
take next 60 in every 360 values
take next 60 in every 5760 values
add this new table of 60 rows as an array to a list
So in the end I want to have a lot of these:
small_df.head(6)
BTC-USDT_1m BTC-USDT_5m BTC-USDT_15m BTC-USDT_1h BTC-USDT_6h BTC-USDT_4d
0 61199.011719 61199.011719 61199.011719 61199.011719 61199.011719 61199.011719
1 61201.398438 61241.390625 61159.578125 61079.800781 60922.968750 60968.320312
2 61237.828125 61309.000000 61063.628906 60845.710938 61682.960938 60717.500000
3 61195.578125 61159.578125 61100.000000 61060.000000 62191.000000 60939.210938
4 61221.179688 61165.371094 61079.800781 61220.011719 61282.000000 65934.328125
5 61241.390625 61047.488281 61175.238281 60812.210938 61190.300781 60599.000000
...
60 rows total
(Basically these are the sequences of 60 previous values in different time frames)
So the code is as follows:
seq_list = []
for i in range(len(df) // 2):
r = i+1
small_df = pd.DataFrame()
small_df['BTC-USDT_1m'] = df['BTC-USDT_close'][r:r+seq_len:1].reset_index(drop=True)
small_df['BTC-USDT_5m'] = df['BTC-USDT_close'][r:(r+seq_len)*5:5].dropna().reset_index(drop=True)
small_df['BTC-USDT_15m'] = df['BTC-USDT_close'][r:(r+seq_len)*15:15].dropna().reset_index(drop=True)
small_df['BTC-USDT_1h'] = df['BTC-USDT_close'][r:(r+seq_len)*60:60].dropna().reset_index(drop=True)
small_df['BTC-USDT_6h'] = df['BTC-USDT_close'][r:(r+seq_len)*360:360].dropna().reset_index(drop=True)
small_df['BTC-USDT_4d'] = df['BTC-USDT_close'][r:(r+seq_len)*5760:5760].dropna().reset_index(drop=True)
seq_list.append([small_df, df['target'][r]])
As you can imagine, it's very slow, it can do about 1500 sequences per minute, so the whole process is going to take 12 hours.
Could you please show me a way to speed things up?
Thanks in advance!
You wouldn't do this by indexing as this creates larges indexes and is inefficient. Instead, you would use .rolling() to create rolling windows.
You can see in the documentation, that rolling also supports rolling windows over timestamps. See this copy for the result:
>>> df_time.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
In your case, you could do the following
small_df['BTC-USDT_1m'] = df['BTC-USDT_close'].rolling("1m").mean().reset_index(drop=True)
The first argument is always the size of the window, i.e. the number of samples to take from df. This can be an integer for exact number of smaples or a timestamp, to march through the table based on a fixed timeframe.
In this case it would compute the mean price based on a moving window of 1 minute.
This would be way more accurate then your index-based solution, since there you do not take into account the distance between the timestamps and also you are actually just taking single values, which means you are highly dependent on local fluctuations. A mean over a given window size supplies you with the average change in price over that timespan.
However, if you want just the single price at a given size, then you simply use a small window, like
small_df['BTC-USDT_1m'] = df['BTC-USDT_close'].rolling(1, step=60).reset_index(drop=True)
The step argument here makes the moving window not consider every single element but rather move 60 samples each time a value is taken.
Any solution like yours or the latter with step, of course, produces a number of new samples different from the original one, thus you would have to think if you want to drop NaN values, fill in gaps, use expand, ...
I have a Dataframe which has a column for Minutes and correlated value, the frequency is about 79 seconds but sometimes there is missing data for a period (no rows at all). I want to detect if there is a gap of 25 or more Minutes and delete the dataset if so.
How do I test if there is a gap which is?
The dataframe looks like this:
INDEX minutes data
0 23.000 1.456
1 24.185 1.223
2 27.250 0.931
3 55.700 2.513
4 56.790 1.446
... ... ...
So there is a irregular but short gap and one that exceeds 25 Minutes. In this case I want the dataset to be empty:
I am quite new to Python, especially to Pandas so an explanation would be helpful to learn.
You can use numpy.roll to create a column with shifted values (i.e. the first value from the original column becomes the second value, the second becomes the third, etc):
import pandas as pd
import numpy as np
df = pd.DataFrame({'minutes': [23.000, 24.185, 27.250, 55.700, 56.790]})
np.roll(df['minutes'], 1)
# output: array([56.79 , 23. , 24.185, 27.25 , 55.7 ])
Add this as a new column to your dataframe and subtract the original column with the new column.
We also drop the first row beforehand, since we don't want to calculate the difference from your first timepoint in the original column and your last timepoint that got rolled to the start of the new column.
Then we just ask if any of the values resulting from the subtraction is above your threshold:
df['rolled_minutes'] = np.roll(df['minutes'], 1)
dropped_df = df.drop(index=0)
diff = dropped_df['minutes'] - dropped_df['rolled_minutes']
(diff > 25).any()
# output: True
I'm using pandas to create data frames which will then be imported into PowerBI for visualization. One of the columns in the data frame is a percentage calculation.
I have no issues calculating the values. However, these values appear without the '%' sign at the end, e.g. 55.2 as opposed to 55.2%.
An example of my initial dataframe:
df1 =
year_per pass fail total
---------------------------------
201901 300 700 1000
201902 400 600 1000
201903 200 800 1000
201904 500 500 1000
I then calculate two new columns to state the % of the total that each column represent, such that the new data frame is:
df2 =
year_per pass fail total pass% fail%
---------------------------------------------------
201901 300 700 1000 30.0 70.0
201902 400 600 1000 40.0 60.0
201903 200 800 1000 20.0 80.0
201904 500 500 1000 50.0 50.0
These new % columns are created using the following code:
df2['pass%'] = round((df1['pass'] / df1['total']) * 100,1)
Which works. PowerBI is happy to use those values. However, I'd like it to display the '%' sign at the end for clarity. Therefore, I updated the calculation code to:
df2['pass%'] = (round((df1['pass'] / df1['total']) * 100,1).astype(str))+'%'
This also produces the right output, visually. However, as the values are now strings, PowerBI can't process the new values as the visualization is expecting a number format, not a string.
I've also tried using the following formatting (as mentioned here: how to show Percentage in python):
{0:.1f}%".format()
i.e.:
df2['pass%'] = '{0:.1f}%'.format(round((df1['pass'] / df1['total']) * 100,1))
but get the error:
'TypeError: unsupported format string passed to Series.__format__'
Therefore, I was wondering if there is a way to store the values as a number format with the % sign following the numbers? Otherwise I'll just have to live with the values without the % sign.
This is, because you pass a series to round, which it expects a scalar numeric argument, but gets a series (also format would have a problem with a series). You can do instead:
df2['pass%'] = (df1['pass'] / df1['total']).map(lambda num: '{0:.1f}%'.format(round(num * 100, 1))
But you know, in contrast to the title of your question, this would of course store the percentage as a string.
I'm relatively new to pandas (and python... and programming) and I'm trying to do a Montecarlo simulation, but I have not being able to find a solution that takes a reasonable amount of time
The data is stored in a data frame called "YTDSales" which has sales per day, per product
Date Product_A Product_B Product_C Product_D ... Product_XX
01/01/2014 1000 300 70 34500 ... 780
02/01/2014 400 400 70 20 ... 10
03/01/2014 1110 400 1170 60 ... 50
04/01/2014 20 320 0 71300 ... 10
...
15/10/2014 1000 300 70 34500 ... 5000
and what I want to do is to simulate different scenarios, using for the rest of the year (from October 15 to Year End) the historical distribution that each product had. For example with the data presented I will like to fill the rest of the year with sales between 20 and 1100.
What I've done is the following
# creates range of "future dates"
last_historical = YTDSales.index.max()
year_end = dt.datetime(2014,12,30)
DatesEOY = pd.date_range(start=last_historical,end=year_end).shift(1)
# function that obtains a random sales number per product, between max and min
f = lambda x:np.random.randint(x.min(),x.max())
# create all the "future" dates and fill it with the output of f
for i in DatesEOY:
YTDSales.loc[i]=YTDSales.apply(f)
The solution works, but takes about 3 seconds, which is a lot if I plan to 1,000 iterations... Is there a way not to iterate?
Thanks
Use the size option for np.random.randint to get a sample of the needed size all at once.
One approach that I would consider is briefly as follows.
Allocate the space you'll need into a new array that will have index values from DatesEOY, columns from the original DataFrame, and all NaN values. Then concatenate onto the original data.
Now that you know the length of each random sample you'll need, use the extra size keyword in numpy.random.randint to sample all at once, per column, instead of looping.
Overwrite the data with this batch sampling.
Here's what this could look like:
new_df = pandas.DataFrame(index=DatesEOY, columns=YTDSales.columns)
num_to_sample = len(new_df)
f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample)
output = pandas.concat([YTDSales, new_df], axis=0)
output[len(YTDSales):] = np.asarray(map(f, YTDSales.iteritems())).T
Along the way, I choose to make a totally new DataFrame, by concatenating the old one with the new "placeholder" one. This could obviously be inefficient for very large data.
Another way to approach is setting with enlargement as you've done in your for-loop solution.
I did not play around with that approach long enough to figure out how to "enlarge" batches of indexes all at once. But, if you figure that out, you can just "enlarge" the original data frame with all NaN values (at index values from DatesEOY), and then apply the function about to YTDSales instead of bringing output into it at all.