I can impute the mean and most frequent value using dask-ml like so, this works fine:
mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height'])
df.iloc[:, [0,1]] = mean_imputer.fit_transform(df.iloc[:,[0,1]])
df.iloc[:, [2]] = most_frequent_imputer.fit_transform(df.iloc[:,[2]])
print(df)
Weight Age Height
0 100.0 2.0 5.0
1 85.0 4.5 5.0
2 70.0 7.0 5.0
But what if I have 100 million rows of data it seems that dask would do two loops when it could have done only one, is it possible to run both imputers simultaneously and/or in parallel instead of sequentially? What would be a sample code to achieve that?
You can used dask.delayed as suggested in docs and Dask Toutorial to parallelise the computation if entities are independent of one another.
Your code would look like:
from dask.distributed import Client
client = Client(n_workers=4)
from dask import delayed
import numpy as np
import pandas as pd
from dask_ml import impute
mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
def fit_transform_mi(d):
return mean_imputer.fit_transform(d)
def fit_transform_mfi(d):
return most_frequent_imputer.fit_transform(d)
def setdf(a,b,df):
df.iloc[:, [0,1]]=a
df.iloc[:, [2]]=b
return df
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height'])
a = delayed(fit_transform_mi)(df.iloc[:,[0,1]])
b = delayed(fit_transform_mfi)(df.iloc[:,[2]])
c = delayed(setdf)(a,b,df)
df= c.compute()
print(df)
client.close()
The c object is a lazy Delayed object. This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another.
Dask is useful for speeding computation by parallel processing and when the data does not fit in memory. In the example below, 300M rows of data contained in ten files are imputed using Dask. The graph of the process shows that: 1. The mean and most frequent imputers are run in parallel; 2. All ten files are processed in parallel as well.
Set-up
To prepare a large amount of data, the three rows of data in your question are replicated, to form a data frame with 30M rows. The data frame is saved in ten different files to yield a total of 300M rows with the same stats as in your question.
import numpy as np
import pandas as pd
N = 10000000
weight = np.array([100, np.nan, 70]*N)
age = np.array([2, np.nan, 7]*N)
height = np.array([5, np.nan, 5]*N)
df = pd.DataFrame({'Weight': weight, 'Age': age, 'Height': height})
# Save ten large data frames to disk
for i in range(10):
df.to_parquet(f'./df_to_impute_{i}.parquet', compression='gzip',
index=False)
Dask Imputation
import graphviz
import dask
import dask.dataframe as dd
from dask_ml.impute import SimpleImputer
# Read all files for imputation in a dask data frame from a specific directory
df = dd.read_parquet('./df_to_impute_*.parquet')
# Set up the imputers and columns
mean_imputer = SimpleImputer(strategy='mean')
mostfreq_imputer = SimpleImputer(strategy='most_frequent')
imputers = [mean_imputer, mostfreq_imputer]
mean_cols = ['Weight', 'Age']
freq_cols = ['Height']
columns = [mean_cols, freq_cols]
# Create a new data frame with imputed values, then visualize the computation.
df_list = []
for imputer, col in zip(imputers, columns):
df_list.append(imputer.fit_transform(df.loc[:, col]))
imputed_df = dd.concat(df_list, axis=1)
imputed_df.visualize(filename='imputed.svg', rankdir='LR')
# Save the new data frame to disk
imputed_df.to_parquet('imputed_df.parquet', compression='gzip')
Output
imputed_df.head()
Weight Age Height
0 100.0 2.0 5.0
1 85.0 4.5 5.0
2 70.0 7.0 5.0
3 100.0 2.0 5.0
4 85.0 4.5 5.0
# Check the summary statistics make sense - 300M rows and stats as expected
imputed_df.describe().compute()
Weight Age Height
count 3.000000e+08 3.000000e+08 300000000.0
mean 8.500000e+01 4.500000e+00 5.0
std 1.224745e+01 2.041241e+00 0.0
min 7.000000e+01 2.000000e+00 5.0
25% 7.000000e+01 2.000000e+00 5.0
50% 8.500000e+01 4.500000e+00 5.0
75% 1.000000e+02 7.000000e+00 5.0
max 1.000000e+02 7.000000e+00 5.0
Related
Is there an easy and straightforward way to load the output from sp.stats.describe() into a DataFrame, including the value names? It doesn't seem to be a dictionary format or something related. Ofcourse I can manually attach the relevant column names (see below), but was wondering whether it might be possible to directly load into a DataFrame with named columns.
import pandas as pd
import scipy as sp
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})
sp.stats.describe(data['a'])
pd.DataFrame(a)
pd.DataFrame(a).transpose().rename(columns={0: 'N', 1: 'Min,Max',
2: 'Mean', 3: 'Var',
4: 'Skewness',
5: 'Kurtosis'})
You can use _fields for columns names from named tuple:
a = sp.stats.describe(data['a'])
df = pd.DataFrame([a], columns=a._fields)
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Also is possible create dictionary from named tuples by _asdict:
d = sp.stats.describe(data['a'])._asdict()
df = pd.DataFrame([d], columns=d.keys())
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
I have a csv with two columns, Dates and Profits/Losses that I have read into the data frame.
import os
import csv
import pandas as pd
cpath = os.path.join('..', 'Resources', 'budget_data.csv')
df = pd.read_csv(cpath)
df["Profit/Losses"]= df["Profit/Losses"].astype(int)
data = pd.DataFrame(
[
["2019-01-01", 40],
["2019-02-01", -5],
["2019-03-01", 15],
],
columns = ["Dates", "Profit/Losses"]
)
I want to know the differences of profits and losses per month (with each row being one month) and so thought to use df.diff to calculate the values
df.diff()
This results however in errors as I think it is trying to calculate the dates column as well and I'm not sure how to make it only calculate the profits and losses.
Is this what you are looking for?
import pandas as pd
data = pd.DataFrame(
[
["2019-01-01", 40],
["2019-02-01", -5],
["2019-03-01", 15],
],
columns = ["Dates", "Profit/Losses"]
)
data.assign(Delta=lambda d: d["Profit/Losses"].diff().fillna(0))
Yields
Dates Profit/Losses Delta
0 2019-01-01 40 0
1 2019-02-01 -5 -45.0
2 2019-03-01 15 20.0
Maybe you can do this:
import pandas as pd
x = [[1,2], [1,2], [1,4]]
d = pd.DataFrame(x, columns=['loss', 'profit'])
d.insert(0, "diff", [d['profit'][i] - d['loss'][i] for i in d.index])
d.head()
Gives:
I am trying to calculate the moving average of a very large data set. The number of rows is approx 30M. To illustrate using pandas as follows
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'], 'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean())
Here I am using pandas to calculate the moving average. Using above it takes around 20 minutes to calculate on the 30M dataset. Is there a way to leverage DASK here?
You can use Dask.delayed for your calculation. In the example below, a standard python function which contains the pandas moving average command is turned into a dask function using a #delayed decorator.
import pandas as pd
from dask import delayed
#delayed
def mov_average(x):
x['mov_avg'] = x.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
return x
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'],
'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
df_1 = mov_average(df).compute()
Output
df
Out[22]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
df_1
Out[23]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
Alternatively, you could try converting (or reading your file) into a dask data frame. The visualization of the scheduler tasks shows the parallelization of the calculations. So, if your data frame is large enough you might get a reduction in your computation time. You could also try optimizing the number of data frame partitions.
from dask import dataframe
ddf = dataframe.from_pandas(df, npartitions=3)
ddf['emv'] = ddf.groupby('cust_id')['sales'].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean()).compute().sort_index()
ddf.visualize()
ddf.compute()
cust_id sales emv
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
Using a numpy random number generator, generate arrays on height and weight of the 88,000 people living in Utah.
The average height is 1.75 metres and the average weight is 70kg. Assume standard deviation on 3.
Combine these two arrays using column_stack method and convert it into a pandas DataFrame with the first column named as 'height' and the second column named as 'weight'
I've gotten the randomly generated data. However, I can't seem to convert the array to a DataFrame
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
print(Utah)
df = pd.DataFrame(
[[np_height],
[np_weight]],
index = [0, 1],
columns = ['height', 'weight'])
print(df)
You want 2 columns, yet you passed data [[np_height],[np_weight]] as 1 column. You can set the data as dict.
df = pd.DataFrame({'height':np_height,
'weight':np_weight},
columns = ['height', 'weight'])
print(df)
The data in Utah is already in a suitable shape. Why not use that?
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
df = pd.DataFrame(
data=Utah,
columns=['height', 'weight']
)
print(df.head())
height weight
0 3.57 65.32
1 -0.15 66.22
2 5.65 73.11
3 2.00 69.59
4 2.67 64.95
I'm trying to calculate a rolling statistic that requires all variables in a window from two input columns.
My only solution involves a for loop. Is there a more efficient way, perhaps using Pandas' rolling and apply functions?
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])[1]
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.rolling(2).apply(lambda x: f(x), raw=False) # KeyError: 'a'
I get KeyError: 'a' because df gets passed to f() one series (column) at a time. Specifying axis=1 sends one row and all columns to f(), but neither approach provides the required set of observations.
You could try rolling, mean and sum:
df['result'] = df.rolling(2).mean().sum(axis=1)
a b result
0 1 5 0.0
1 2 6 7.0
2 3 7 9.0
3 4 8 11.0
EDIT
Adding a different answer based upon new information in the question by OP.
Set up the function.
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])
Create the data and dataframe:
a_data = [1,2,3,4]
b_data = [5,6,7,8]
df = pd.DataFrame(data={'a': a_data, 'b': b_data})
a b
0 1 5
1 2 6
2 3 7
3 4 8
I gather after researching coint that you are trying to pass two rolling arrays to f['a'] and f['b']. The following will create the arrays and dataframe.
n=2
arr_a = [df['a'].shift(x).values[::-1][:n] for x in range(len(df['a']))[::-1]]
arr_b = [df['b'].shift(x).values[::-1][:n] for x in range(len(df['b']))[::-1]]
df1 = pd.DataFrame(data={'a': arr_a, 'b': arr_b})
n is the size of the rolling window.
df1
a b
0 [1.0, nan] [5.0, nan]
1 [2.0, 1.0] [6.0, 5.0]
2 [3.0, 2.0] [7.0, 6.0]
3 [4, 3] [8, 7]
Then you can use apply.(f) to send in the rows of arrays.
df1.iloc[(n-1):,].apply(f, axis=1)
Your output is as follows:
1 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
2 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
3 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
dtype: object
When I run this I do get an error for perfectly colinear data, but I suspect that will disappear with real data.
Also, I know a purely vecotorized solution might have been faster. I wonder what the performance will be like for this if it what you are looking for?
Hats off to #Zero who really had the solution for this problem here.
I tried placing the sum before the rolling:
import pandas as pd
import time
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.copy()
s = time.time()
df2.loc[:, 'mean1'] = df.sum(axis = 1).rolling(2).mean()
print(time.time() - s)
s = time.time()
df2.loc[:, 'mean2'] = df.rolling(2).mean().sum(axis=1)
print(time.time() - s)
df2
0.003737926483154297
0.005460023880004883
a b mean1 mean2
0 1 5 NaN 0.0
1 2 6 7.0 7.0
2 3 7 9.0 9.0
3 4 8 11.0 11.0
It is slightly faster than the previous answer, but works the same and maybe in large datasets the difference migth significant.
You can modify it to select the columns of interest only:
s = time.time()
print(df[['a', 'b']].sum(axis = 1).rolling(2).mean())
print(time.time() - s)
0 NaN
1 7.0
2 9.0
3 11.0
dtype: float64
0.0033559799194335938