random sampling with pandas dataframe - python

I'm relatively new to pandas (and python... and programming) and I'm trying to do a Montecarlo simulation, but I have not being able to find a solution that takes a reasonable amount of time
The data is stored in a data frame called "YTDSales" which has sales per day, per product
Date Product_A Product_B Product_C Product_D ... Product_XX
01/01/2014 1000 300 70 34500 ... 780
02/01/2014 400 400 70 20 ... 10
03/01/2014 1110 400 1170 60 ... 50
04/01/2014 20 320 0 71300 ... 10
...
15/10/2014 1000 300 70 34500 ... 5000
and what I want to do is to simulate different scenarios, using for the rest of the year (from October 15 to Year End) the historical distribution that each product had. For example with the data presented I will like to fill the rest of the year with sales between 20 and 1100.
What I've done is the following
# creates range of "future dates"
last_historical = YTDSales.index.max()
year_end = dt.datetime(2014,12,30)
DatesEOY = pd.date_range(start=last_historical,end=year_end).shift(1)
# function that obtains a random sales number per product, between max and min
f = lambda x:np.random.randint(x.min(),x.max())
# create all the "future" dates and fill it with the output of f
for i in DatesEOY:
YTDSales.loc[i]=YTDSales.apply(f)
The solution works, but takes about 3 seconds, which is a lot if I plan to 1,000 iterations... Is there a way not to iterate?
Thanks

Use the size option for np.random.randint to get a sample of the needed size all at once.
One approach that I would consider is briefly as follows.
Allocate the space you'll need into a new array that will have index values from DatesEOY, columns from the original DataFrame, and all NaN values. Then concatenate onto the original data.
Now that you know the length of each random sample you'll need, use the extra size keyword in numpy.random.randint to sample all at once, per column, instead of looping.
Overwrite the data with this batch sampling.
Here's what this could look like:
new_df = pandas.DataFrame(index=DatesEOY, columns=YTDSales.columns)
num_to_sample = len(new_df)
f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample)
output = pandas.concat([YTDSales, new_df], axis=0)
output[len(YTDSales):] = np.asarray(map(f, YTDSales.iteritems())).T
Along the way, I choose to make a totally new DataFrame, by concatenating the old one with the new "placeholder" one. This could obviously be inefficient for very large data.
Another way to approach is setting with enlargement as you've done in your for-loop solution.
I did not play around with that approach long enough to figure out how to "enlarge" batches of indexes all at once. But, if you figure that out, you can just "enlarge" the original data frame with all NaN values (at index values from DatesEOY), and then apply the function about to YTDSales instead of bringing output into it at all.

Related

Add together elements from Pandas DataFrame based on timestamp

I am trying to add together elements in the second column from from two dataframes where the time(in the first column) is the same, however the time in each DataFrame is spaced at different intervals. So, in the image below, I would like to add the y values of both lines together:
enter image description here
So where they overlap, the combined value would be at around 3200.
Each dataframe is two columns: first one is time in unix timestamp, and the second column is power in watts, and the spacing between each row is usually 6 seconds, but sometimes more or less. Also, each dataframe starts and ends at a different time, although there is some overlap in the inner portion.
I've added the first few rows for ease of viewing:
df1:
time power
0 1355526770 1500
1 1355526776 1800
2 1355526782 1600
3 1355526788 1700
4 1355526794 1400
df2:
time power
0 1355526771 1250
1 1355526777 1200
2 1355526783 1280
3 1355526789 1290
4 1355526795 1300
I first though to reindex each dataframe inserting a row for every second across the time range of each df, and then linearly interpolating the power value data between each time. Then I would add together the dataframes by adding the power value where the timestamp matched exactly.
The problem with this method is that it would increase the size of each dataframe by at least 6x, and since they're already pretty big, this would slow things down a lot.
If anyone knows another method to do this I would be very grateful.
Beyond what the other users have said, you could also consider trying out Modin instead of pure pandas for your datasets if you want another way to speed up computation and so forth. Modin is easily integrated with your system with just one line of code. Take a look here: IntelĀ® Distribution of Modin
Using a merge_asof to align on the nearest time:
(pd.merge_asof(df1, df2, on='time', direction='nearest', suffixes=(None, '_2'))
.assign(power=lambda d: d['power'].add(d.pop('power_2')))
)
Output:
time power
0 1355526770 2750
1 1355526776 3000
2 1355526782 2880
3 1355526788 2990
4 1355526794 2700

Pandas get the average last hour speed from accumulated network data

I have a pandas DataFrame which record the accumulated network traffic (bytes) from several programs at certain (but not constant) interval. It is like the "all time download / upload" data in some programs. The DataFrame is constantly renewing, some columns are deleted and some are added. The index is pandas.DatetimeIndex.
Looks like this:
Program_A Program_B Program_C
2020-10-21 19:30:01.352301 100 200 NaN
2020-10-21 19:45:01.245997 200 250 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-10-22 17:30:01.123456 10000 700 NaN
2020-10-22 17:45:01.158689 30000 700 NaN
2020-10-22 18:00:01.191560 50000 700 NaN
2020-10-22 18:15:01.208001 70000 700 NaN
2020-10-22 18:20:28.401580 100000 700 5000
2020-10-22 18:30:01.281731 110000 700 200000
Every time I retrieve the DataFrame, I want to calculate the average traffic speed (byte/sec) for every column in the last hour. I want something like this:
last_hour_avgspeed(myDataFrame)
-->
Program_A 27.7
Program_B 0.0
Program_C 325.0
......
dtype: float64
There could be NaNs in the data because some columns are added within an hour. So a simple (last row - first row) / 3600 would not work.
I'm new to pandas. I first wrote a function:
def avgspeed(series: pd.Series):
lo = series.first_valid_index()
hi = series.last_valid_index()
s = series[hi] - series[lo]
t = (hi - lo).total_seconds()
return s // t if t > 0 else np.nan
Then apply this to every column:
myDataFrame.last('H').apply(avgspeed)
I believe this do give the correct result: a pandas Series of column-speed pairs. However, I'm feeling this must not be the best way. Where is the vectorization? Can we get the result in one hit?
I have tried another method:
myDataFrame.last("H").resample("T").bfill().diff().mean().floordiv(60)
First resample the data to 1-min samples (not 1s because too slow), then calculate the mean of the differences, then divide it by 60 seconds... I think this is more silly than the first method. But the performance is actually two times faster than the first one. However, the result of columns containing NaN is somewhat different from the first one. It could because the bfill method brought some problems, I think.
So, what is the correct way to do the calculation?
I couldn't understand well but i think you need to use groupby and aggregate
df.groupby().agg(column='' , aggfunc=mean)
you can read more info : https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
Well, I think I figured out a vectorized solution:
def get_avgspeed(df: pd.DataFrame, offset: str) -> pd.Series:
"""Calculate average traffic speed in the final period of time based on offset."""
# truncate the dataframe to the last N time units
df = df.truncate(before=(pd.Timestamp.now()- pd.Timedelta(offset)), copy=False)
# calculate the difference between the last and the first valid index per row
t = df.index[-1] - df.apply(pd.Series.first_valid_index)
# df.bfill() will back fill NaNs, so iloc[0] can give us the first valid value per row
# then calculate the value difference: df.iloc[-1] - df.bfill().iloc[0]
# return the speed: v = s/t
return (df.iloc[-1] - df.bfill().iloc[0]) // t.dt.total_seconds()
avg_speed(myDataFrame, '24H')
-->
Program_A 247.0
Program_B 16.0
Program_C 197620.0
Program_X 252943.0
... ...
dtype: float64
In my test, the performance seems to be a little better than the no vectorized version in the question: 1.77ms vs 2.1ms.

How to calculate moving average incrementally with daily data added to data frame in pandas?

I have daily data and want to calculate 5 days, 30 days and 90 days moving average per user and write out to a CSV. New data comes in everyday. How do I calculate these averages for the new data only, assuming I will load the data frame with last 89 days data plus today's data.
date user daily_sales 5_days_MA 30_days_MV 90_days_MV
2019-05-01 1 34
2019-05-01 2 20
....
2019-07-18 .....
The number of rows per day is about 1 million. If data for 90days is too much, 30 days is OK
You can apply rolling() method on your dataset if it's in DataFrame format.
your_df['MA_30_days'] = df[where_to_apply].rolling(window = 30).mean()
If you need different window on which moving average will be calculated just change window parameter. In my example I used mean() to calculate but you can choose some other statistic as well.
This code will create another column named 'MA_30_days' with calculated moving average in your DataFrame.
You can also create another DataFrame where you will collect and loop over your dataset to calculate all moving averages and save it to CSV format as you wanted.
your_df.to_csv('filename.csv')
In your case to calculation should be consider only the newest data. If you want to perform this on latest data just slice it. However the very first rows will be NaN (depends on window).
df[where_to_apply][-90:].rolling(window = 30).mean()
This will calculate moving average on last 90 rows of specific column in some df and first 29 rows would be NaN. If your latest 90 rows should be all meaningful data than you can start calculation earlier than on last 90 rows - depends on window size.
if the df already contains yesterday's moving average, and just the new day's Simple MA is required, I would say use this approach:
MAlength=90
df.loc[day-1:'MA']=(
(df.loc[day-1:'MA']*MAlength) #expand yesterday's MA value
-df.loc[day-MAlength:'Price'] #remove oldest price
+df.loc[day-MAlength:'Price'] #add newest price
)/MAlength #re-average

Pandas infrastructure data statistics plot with date per user

I am trying to display some infrastructure usage daily statistics with Pandas but I'm a beginner and can't figure it out after many hours of research.
Here's my data types per column:
Name object UserService object
ItemSize int64 ItemsCount int64
ExtractionDate datetime64[ns]
Each day I have a new extraction for each users, so I probably need to use the group_by before plotting.
Data sample:
Name UserService ItemSize ItemsCount ExtractionDate
1 xyzf_s xyfz 40 1 2018-12-12
2 xyzf1 xyzf 53 5 2018-12-12
3 xyzf2 xyzf 71 4 2018-12-12
4 xyzf3 xyzf 91 3 2018-12-12
14 vo12 vo 41 5 2018-12-12
One of the graph I am trying to display is as follow:
x axis should be the extraction date
y axis should be the items count (it's divided by 1000 so it's by thousands of items from 1 to 100)
Each line on the graph should represent a user evolution (to look at data spikes), I guess I would have to display the top 10 or 50 because it would be difficult to have a graph of 1500 users.
I'm also interested by any other way you would exploit those data to look for data increase and anomaly in data consumption.
Assuming the user is shown in the name columns and there is only one line per user per day, to get the plot you are explicitly asking for, you can use the following code:
# Limit to 10 users
users_to_plot = df.Name.unique()[:10]
for u in users_to_plot:
mask = (df['Name'] == u)
values = df[mask]
plt.plot('ExtractionDate','ItemsCount',data=values.sort_values('ExtractionDate'))
It's important to look at the data and think about what information you are trying to extract and what that looks like. It's probably worth exploring with some individuals first and getting an idea of what is the thing you are trying to identify. Think about what makes that unique and if you can make it pop on a graph.

Pandas - get first n-rows based on percentage

I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.
for example,
df.head(n=10)
Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set.
How to do this in pandas.
I'm looking for a code like this,
df.head(frac=0.05)
Is there any simple way to get this?
I want to pop first 5% of record
There is no built-in method but you can do this:
You can multiply the total number of rows to your percent and use the result as parameter for head method.
n = 5
df.head(int(len(df)*(n/100)))
So if your dataframe contains 1000 rows and n = 5% you will get the first 50 rows.
I've extended Mihai's answer for my usage and it may be useful to people out there.
The purpose is automated top-n records selection for time series sampling, so you're sure you're taking old records for training and recent records for testing.
# having
# import pandas as pd
# df = pd.DataFrame...
def sample_first_prows(data, perc=0.7):
import pandas as pd
return data.head(int(len(data)*(perc)))
train = sample_first_prows(df)
test = df.iloc[max(train.index):]
I also had the same problem and #mihai's solution was useful. For my case I re-wrote to:-
percentage_to_take = 5/100
rows = int(df.shape[0]*percentage_to_take)
df.head(rows)
I presume for last percentage rows df.tail(rows) or df.head(-rows) would work as well.
may be this will help:
tt = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
df=pd.DataFrame(np.random.randn(10,2))
print(df)
0 1
0 0.375727 -1.297127
1 -0.676528 0.301175
2 -2.236334 0.154765
3 -0.127439 0.415495
4 1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931 2.089305
7 0.075599 0.404521
8 1.836577 -0.762597
9 0.294883 0.540444
#70% of the Dataframe
part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
0 1
8 1.836577 -0.762597
2 -2.236334 0.154765
5 -0.884309 -0.108502
6 -0.884931 2.089305
3 -0.127439 0.415495
1 -0.676528 0.301175
0 0.375727 -1.297127

Categories

Resources