I am trying to compute the rolling mean over the last n days(with n = 30) on a large dataset.
In Pandas, I'd use the following command:
temp = chunk.groupby('id_code').apply(lambda x: x.set_index('entry_time_flat').resample('1D').first())
dd = temp.groupby(level=0)['duration'
].apply(lambda x: x.shift().rolling(min_periods = 1,window = n_days).mean()
).reset_index(name = "avg_delay_"+ str(n_days) + "_days")
chunk = pd.merge(chunk, dd, on=['entry_time_flat', 'id_code'], how='left'
).dropna(subset = ["avg_delay_"+ str(n_days) + "_days"])
Basically, the function groups by "id code" and, for the last n-days over "entry_time_flat" (a datetime object), computes the mean value of feature "duration".
However, in order to keep the code efficient, it would be great to reproduce this function on a Dask dataframe, without transforming it into a Pandas DF.
If I run the aforementioned code on a Dask DF, it raises the following error:
TypeError: __init__() got an unexpected keyword argument 'level'
Ultimately, how could I compute the mean of the "duration" column, over the last n-days on a Dask dataframe?
Ultimately, how could I compute the mean of the "duration" column, over the last n-days on a Dask dataframe?
The rolling API should give you this functionality
https://docs.dask.org/en/latest/dataframe-api.html#rolling
Related
I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.
Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation function as my dataset is too large, and it eats up my memory in a hurry. What I have now is a slow, double for loop that starts with the first column, and finds the correlation threshold between it and all the other columns one by one, and if it's above 0.99, drop that 2nd comparative column, then starts at the new second column, and so on and so forth, KIND OF like the solution found here, however this is unbearably slow doing this in an iterative form across all columns, although it's actually possible to run it and not run into memory issues.
I've read the API here, and see how to drop columns using Dask here, but need some assistance in getting this figured out. I'm wondering if there's a faster, yet memory friendly, way of dropping highly correlated columns in a Pandas Dataframe using Dask? I'd like to feed in a Pandas dataframe to the function, and have it return a Pandas dataframe after the correlation dropping is done.
Anyone have any resources I can check out, or have an example of how to do this?
Thanks!
UPDATE
As requested, here is my current correlation dropping routine as described above:
print("Checking correlations of all columns...")
cols_to_drop_from_high_corr = []
corr_threshold = 0.99
for j in df.iloc[:,1:]: # Skip column 0
try: # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
# original list, so if a feature is no longer in there from dropping it prior, it'll throw an error
for k in df.iloc[:,1:]: # Start 2nd loop at first column also...
# If comparing the same column to itself, skip it
if (j == k):
continue
else:
try: # second try/except mandatory
correlation = abs(df[j].corr(df[k])) # Get the correlation of the first col and second col
if correlation > corr_threshold: # If they are highly correlated...
cols_to_drop_from_high_corr.append(k) # Add the second col to list for dropping when round is done before next round.")
except:
continue
# Once we have compared the first col with all of the other cols...
if len(cols_to_drop_from_high_corr) > 0:
df = df.drop(cols_to_drop_from_high_corr, axis=1) # Drop all the 2nd highly corr'd cols
cols_to_drop_from_high_corr = [] # Reset the list for next round
# print("Dropped all cols from most recent round. Continuing...")
except: # Now, if the first for loop tries to find a column that's been dropped already, just continue on
continue
print("Correlation dropping completed.")
UPDATE
Using the solution below, I'm running into a few errors and due to my limited dask syntax knowledge, I'm hoping to get some insight. Running Windows 10, Python 3.6 and the latest version of dask.
Using the code as is on MY dataset (the dataset in the link says "file not found"), I ran into the first error:
ValueError: Exactly one of npartitions and chunksize must be specified.
So I specify npartitions=2 in the from_pandas, then get this error:
AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'
I tried changing that to .rechunk('auto'), but then got error:
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes
My original dataframe is in the shape of 1275 rows, and 3045 columns. The dask array shape says shape=(nan, 3045). Does this help to diagnose the issue at all?
I'm not sure if this help but maybe it could be a starting point.
Pandas
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = pd.read_csv(url)
# we check correlation for these columns only
cols = df.columns[-8:]
# columns in this df don't have a big
# correlation coefficient
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# we take the upper triangular only
corr = np.triu(corr)
# we want high correlation but not diagonal elements
# it returns a bool matrix
out = (corr != 1) & (corr > corr_threshold)
# for every row we want only the True columns
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)
Dask
Here I comment only the steps are different from pandas
import dask.dataframe as dd
import dask.array as da
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = dd.read_csv(url)
cols = df.columns[-8:]
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# with dask we need to rechunk
corr = corr.compute_chunk_sizes()
corr = da.triu(corr)
out = (corr != 1) & (corr > corr_threshold)
# dask is lazy
out = out.compute()
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)
I'm trying to calculate Welles Wilder's type of moving average in a panda dataframe (also called cumulative moving average).
The method to calculate the Wilder's moving average for 'n' periods of series 'A' is:
Calculate the mean of the first 'n' values in 'A' and set as the mean for the 'n' position.
For the following values use the previous mean weighed by (n-1) and the current value of the series weighed by 1 and divide all by 'n'.
My question is: how to implement this in a vectorized way?
I tried to do it iterating over the dataframe (what a I read isn't recommend because is slow). It works, the values are correct, but I get an error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
and it's probably not the most efficient way of doing it.
My code so far:
import pandas as pd
import numpy as np
#Building Random sample:
datas = pd.date_range('2020-01-01','2020-01-31')
np.random.seed(693)
A = np.random.randint(40,60, size=(31,1))
df = pd.DataFrame(A,index = datas, columns = ['A'])
period = 12 # Main parameter
initial_mean = A[0:period].mean() # Equation for the first value.
size = len(df.index)
df['B'] = np.full(size, np.nan)
df.B[period-1] = initial_mean
for x in range(period, size):
df.B[x] = ((df.A[x] + (period-1)*df.B[x-1]) / period) # Equation for the following values.
print(df)
You can use the Pandas ewm() method, which behaves exactly as you described when adjust=False:
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0];
weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i]
If you want to do the simple average of the first period items, you can do that first and apply ewm() to the result.
You can calculate a series with the average of the first period items, followed by the other items repeated verbatim, with the formula:
pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
)
So in order to calculate the Wilder moving average and store it in a new column 'C', you can use:
df['C'] = pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
).ewm(
alpha=1.0 / period,
adjust=False,
).mean()
At this point, you can calculate df['B'] - df['C'] and you'll see that the difference is almost zero (there's some rounding error with float numbers.) So this is equivalent to your calculation using a loop.
You might want to consider skipping the direct average between the first period items and simply start applying ewm() from the start, which will assume the first row is the previous average in the first calculation. The results will be slightly different but once you've gone through a couple of periods then those initial values will hardly influence the results.
That would be a way more simple calculation:
df['D'] = df['A'].ewm(
alpha=1.0 / period,
adjust=False,
).mean()
Has anyone had issues with rolling standard deviations not working on only one column in a pandas dataframe?
I have a dataframe with a datetime index and associated financial data. When I run df.rolling().std() (psuedo code, see actual below), I get correct data for all columns except one. That column returns 0's where there should be standard deviation values. I also get the same error when using .rolling_std() and I get an error when trying to run df.rolling().skew(), all the other columns work and this column gives NaN.
What's throwing me off about this error is that the other columns work correctly and for this column, df.rolling().mean() works. In addition, the column has dtype float64, which shouldn't be a problem. I also checked and don't see missing data. I'm using a rolling window of 30 days and if I try to get the last standard deviation value using series[-30:].std() I get a correct result. So it seems like something specifically about the rolling portion isn't working. I played around with the parameters of .rolling() but couldn't get anything to change.
# combine the return, volume and slope data
raw_factor_data = pd.concat([fut_rets, vol_factors, slope_factors], axis=1)
# create new dataframe for each factor type (mean,
# std dev, skew) and combine
mean_vals = raw_factor_data.rolling(window=past, min_periods=past).mean()
mean_vals.columns = [column + '_mean' for column in list(mean_vals)]
std_vals = raw_factor_data.rolling(window=past, min_periods=past).std()
std_vals.columns = [column + '_std' for column in list(std_vals)]
skew_vals = raw_factor_data.rolling(window=past, min_periods=past).skew()
skew_vals.columns = [column + '_skew' for column in list(skew_vals)]
fact_data = pd.concat([mean_vals, std_vals, skew_vals], axis=1)
The first line combines three dataframes together. Then I create separate dataframes with rolling mean, std and skew (past = 30), and then combine those into a single dataframe.
The name of the column I'm having trouble with is 'TY1_slope'. So I've run some code as follows to see where there is an error.
print raw_factor_data['TY1_slope'][-30:].std()
print raw_factor_data['TY1_slope'][-30:].mean()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).std()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).mean()
The first two lines of code output a correct standard deviation and mean (.08 and .14). However, the third line of code produces zeroes but the fourth line produces accurate mean values (the final values in those series are 0.0 and .14).
If anyone can help with how to look at the .rolling source code that would be helpful too. I'm new to doing that and tried the following, but just got a few lines that didn't seem very helpful.
import inspect
import pandas as pd
print inspect.getsourcelines(pd.rolling_std)
Quoting JohnE's comment since it worked (although still not sure the root cause of the issue). JohnE, feel free to change to an answer and I'll upvote.
shot in the dark, but you could try rolling(30).apply( lambda x: np.std(x,ddof=1) ) in case it's some weird syntax bug with rolling + std – JohnE
I'm trying to understand how to go around this in python pandas. My objective is to fill column "RESULT" with the initial investment and apply the profit on top of the previous result.
So if I would use an excel spreadsheet I would do this:
Ask what's the initial_investment (in this example $350)
Compute the first row as profit/100*initial_investment + initial_investment
the 2nd and forth will be the same with the exception that "initial_investment" is in the raw above.
my initial python code is this
import pandas as pd
df = pd.DataFrame({"DATE":[2009,2010,2011,2012,2013,2014,2015,2016],"PROFIT":[10,4,5,7,-10,5,-5,3],"RESULT":[350,350,350,350,350,350,350,350]})
print df
You can use the cumulative product function cumprod():
df['RESULT'] = ((df.PROFIT + 100) / 100.).cumprod() * 350
First you transform df.PROFIT into a proportion of the previous value. Then cumprod() multiplies each row by the previous rows. You can then just multiply this by whatever your initial value is.