I'm not quite sure how to ask this question, but I need some clarification on how to make use of Dask's ability to "handle datasets that don't fit into memory", because I'm a little confused on how it works from the CREATION of these datasets.
I have made a reproducible code below that closely emulates my problem. Although this example DOES fit into my 16Gb memory, we can assume that it doesn't because it does take up ALMOST all of my RAM.
I'm working with 1min, 5min, 15min and Daily stock market datasets, all of which have their own technical indicators, so each of these separate dataframes are 234 columns in width, with the 1min dataset having the most rows (521,811), and going down from there. Each of these datasets can be created and fit into memory on their own, but here's where it gets tricky.
I'm trying to merge them column-wise into 1 dataframe, each column prepended with their respective timeframes so I can tell them apart, but this creates the memory problem. This is what I'm looking to accomplish visually:
I'm not really sure if Dask is what I need here, but I assume so. I'm NOT looking to use any kind of "parallel calculations" here (yet), I just need a way to create this dataframe before feeding it into a machine learning pipeline (yes, I know it's a stock market problem, just overlook that for now). I know Dask has a machine learning pipeline I can use, so maybe I'll make use of that in the future, however I need a way to save this big dataframe to disk, or create it upon importing it on the fly.
What I need help with is how to do this. Seeing as each of these datasets on their own fit into memory nicely, an idea I had (and this may not be correct at all so please let me know), would be to save each of the dataframes to separate parquet files to disk, then create a Dask dataframe object to import each of them into, when I go to start the machine learning pipeline. Something like this:
Is this conceptually correct with what I need to do, or am I way off? haha. I've read through the documentation on Dask, and also checked out this guide specifically, which is good, however as a newbie I need some guidance with how to do this for the first time.
How can I create and save this big merged dataframe to disk, if I can't create it in memory in the first place?
Here is my reproducible dataframe/memory problem code. Be careful when you go to run this as it'll eat up your RAM pretty quickly, I have 16Gb of RAM and it does run on my fairly light machine, but not without some red-lining RAM, just wanted to give the Dask gods out there something specific to work with. Thanks!
from pandas import DataFrame, date_range, merge
from numpy import random
# ------------------------------------------------------------------------------------------------ #
# 1 MINUTE DATASET #
# ------------------------------------------------------------------------------------------------ #
ONE_MIN_NUM_OF_ROWS = 521811
ONE_MIN_NUM_OF_COLS = 234
main_df = DataFrame(random.randint(0,100, size=(ONE_MIN_NUM_OF_ROWS, ONE_MIN_NUM_OF_COLS)),
columns=list("col_" + str(x) for x in range(ONE_MIN_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="min", periods=ONE_MIN_NUM_OF_ROWS))
# ------------------------------------------------------------------------------------------------ #
# 5 MINUTE DATASET #
# ------------------------------------------------------------------------------------------------ #
FIVE_MIN_NUM_OF_ROWS = 117732
FIVE_MIN_NUM_OF_COLS = 234
five_min_df = DataFrame(random.randint(0,100, size=(FIVE_MIN_NUM_OF_ROWS, FIVE_MIN_NUM_OF_COLS)),
columns=list("5_min_col_" + str(x) for x in range(FIVE_MIN_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="5min", periods=FIVE_MIN_NUM_OF_ROWS))
# Merge the 5 minute to the 1 minute df
main_df = merge(main_df, five_min_df, how="outer", left_index=True, right_index=True, sort=True)
# ------------------------------------------------------------------------------------------------ #
# 15 MINUTE DATASET #
# ------------------------------------------------------------------------------------------------ #
FIFTEEN_MIN_NUM_OF_ROWS = 117732
FIFTEEN_MIN_NUM_OF_COLS = 234
fifteen_min_df = DataFrame(random.randint(0,100, size=(FIFTEEN_MIN_NUM_OF_ROWS, FIFTEEN_MIN_NUM_OF_COLS)),
columns=list("15_min_col_" + str(x) for x in range(FIFTEEN_MIN_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="15min", periods=FIFTEEN_MIN_NUM_OF_ROWS))
# Merge the 15 minute to the main df
main_df = merge(main_df, fifteen_min_df, how="outer", left_index=True, right_index=True, sort=True)
# ------------------------------------------------------------------------------------------------ #
# DAILY DATASET #
# ------------------------------------------------------------------------------------------------ #
DAILY_NUM_OF_ROWS = 933
DAILY_NUM_OF_COLS = 234
fifteen_min_df = DataFrame(random.randint(0,100, size=(DAILY_NUM_OF_ROWS, DAILY_NUM_OF_COLS)),
columns=list("daily_col_" + str(x) for x in range(DAILY_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="D", periods=DAILY_NUM_OF_ROWS))
# Merge the daily to the main df (don't worry about "forward peaking" dates)
main_df = merge(main_df, fifteen_min_df, how="outer", left_index=True, right_index=True, sort=True)
# ------------------------------------------------------------------------------------------------ #
# FFILL NAN's #
# ------------------------------------------------------------------------------------------------ #
main_df = main_df.fillna(method="ffill")
# ------------------------------------------------------------------------------------------------ #
# INSPECT #
# ------------------------------------------------------------------------------------------------ #
print(main_df)
UPDATE
Thanks to the top answer below, I'm getting closer to my solution.
I've fixed a few syntax errors in the code, and have a working example, UP TO the daily timeframe. When I use the 1B timeframe for upsampling to business days, the error is:
ValueError: <BusinessDay> is a non-fixed frequency
I think it has something to do with this line:
data_index = higher_resolution_index.floor(data_freq).drop_duplicates()
...as that's what I see in the traceback. I don't think Pandas likes the 1B timeframe and the floor() function, so is there an alternative?
I need to have daily data in there too, however the code works for every other timeframe. Once I can get this daily thing figured out, I'll be able to apply it to my use case.
Thanks!
from pandas import DataFrame, concat, date_range
from numpy import random
import dask.dataframe as dd, dask.delayed
ROW_CHUNK_SIZE = 5000
def load_data_subset(start_date, freq, data_freq, hf_periods):
higher_resolution_index = date_range(start_date, freq=freq, periods=hf_periods)
data_index = higher_resolution_index.floor(data_freq).drop_duplicates()
dummy_response = DataFrame(
random.randint(0, 100, size=(len(data_index), 234)),
columns=list(
f"{data_freq}_col_" + str(x) for x in range(234)
),
index=data_index
)
dummy_response = dummy_response.loc[higher_resolution_index.floor(data_freq)].set_axis(higher_resolution_index)
return dummy_response
#dask.delayed
def load_all_columns_for_subset(start_date, freq, hf_periods):
return concat(
[
load_data_subset(start_date, freq, "1min", hf_periods),
load_data_subset(start_date, freq, "5min", hf_periods),
load_data_subset(start_date, freq, "15min", hf_periods),
load_data_subset(start_date, freq, "1H", hf_periods),
load_data_subset(start_date, freq, "4H", hf_periods),
load_data_subset(start_date, freq, "1B", hf_periods),
],
axis=1,
)
ONE_MIN_NUM_OF_ROWS = 521811
full_index = date_range(
start="2019-12-09 04:00:00",
freq="1min",
periods=ONE_MIN_NUM_OF_ROWS,
)
df = dask.dataframe.from_delayed([load_all_columns_for_subset(full_index[i], freq="1min", hf_periods=ROW_CHUNK_SIZE) for i in range(0, ONE_MIN_NUM_OF_ROWS, ROW_CHUNK_SIZE)])
# Save df to parquet here when ready
I would take the dask.dataframe tutorial and look at the dataframe best practices guide. dask can work with larger-than-memory datasets generally by one of two approaches:
design your job ahead of time, then iterate through partitions of the data, writing the outputs as you go, so that not all of the data is in memory at the same time.
use a distributed cluster to leverage more (distributed) memory than exists on any one machine.
It sounds like you're looking for approach (1). The actual implementation will depend on how you access/generate the data, but generally I'd say you should not think of the job as "generate the larger-than-memory dataset in memory then dump it into the dask dataframe". Instead, you'll need to think carefully about how to load the data partition-by-partition, so that each partition can work independently.
Modifying your example, the full workflow might look something like this:
import pandas as pd, numpy as np, dask.dataframe, dask.delayed
#dask.delayed
def load_data_subset(start_date, freq, periods):
# presumably, you'd query some API or something here
dummy_ind = pd.date_range(start_date, freq=freq, periods=periods)
dummy_response = pd.DataFrame(
np.random.randint(0, 100, size=(len(dummy_ind), 234)),
columns=list("daily_col_" + str(x) for x in range(234)),
index=dummy_ind
)
return dummy_response
# generate a partitioned dataset with a configurable frequency, with each dataframe having a consistent number of rows.
FIFTEEN_MIN_NUM_OF_ROWS = 117732
full_index = pd.date_range(
start="2019-12-09 04:00:00",
freq="15min",
periods=FIFTEEN_MIN_NUM_OF_ROWS,
)
df_15min = dask.dataframe.from_delayed([
load_data_subset(full_index[i], freq="15min", periods=10000)
for i in range(0, FIFTEEN_MIN_NUM_OF_ROWS, 10000)
])
You could now write these to disk, concat, etc, and at any given point, each dask worker will only be working with 10,000 rows at a time. Ideally, you'll design the chunks so each partition will have a couple hundred MBs each - see the best practices section on partition sizing.
This could be extended to include multiple frequencies like this:
import pandas as pd, numpy as np, dask.dataframe, dask.delayed
def load_data_subset(start_date, freq, data_freq, hf_periods):
# here's your 1min time series *for this partition*
high_res_ind = pd.date_range(start_date, freq=freq, periods=hf_periods)
# here's your lower frequency (e.g. 1H, 1day) index
# for the same period
data_ind = high_res_ind.floor(data_freq).drop_duplicates()
# presumably, you'd query some API or something here.
# Alternatively, you could read subsets of your pre-generated
# frequency files. this covers the same dates as the 1 minute
# dataset, but only has the number of periods in the lower-res
# time series
dummy_response = pd.DataFrame(
np.random.randint(0, 100, size=(len(data_ind), 234)),
columns=list(
f"{data_freq}_col_" + str(x) for x in range(234)
),
index=data_ind
)
# now, reindex to the shape of the new data (this does the
# forward fill step):
dummy_response = (
dummy_response
.loc[high_res_ind.floor(data_freq)]
.set_axis(high_res_ind)
)
return dummy_response
#dask.delayed
def load_all_columns_for_subset(start_date, periods):
return pd.concat(
[
load_data_subset(start_date, "1min", "1min", periods),
load_data_subset(start_date, "1min", "5min", periods),
load_data_subset(start_date, "1min", "15min", periods),
load_data_subset(start_date, "1min", "D", periods),
],
axis=1,
)
# generate a partitioned dataset with all columns, where lower
# frequency columns have been ffilled, with each dataframe having
# a consistent number of rows.
ONE_MIN_NUM_OF_ROWS = 521811
full_index = pd.date_range(
start="2019-12-09 04:00:00",
freq="1min",
periods=ONE_MIN_NUM_OF_ROWS,
)
df_full = dask.dataframe.from_delayed([
load_all_columns_for_subset(full_index[i], periods=10000)
for i in range(0, ONE_MIN_NUM_OF_ROWS, 10000)
])
This runs straight through for me. It also exports the full dataframe just fine if you call df_full.to_parquet(filepath) right after this. I ran this with a dask.distributed scheduler (running on my laptop) and kept an eye on the dashboard and total memory never exceeded 3.5GB.
Because there are so many columns the dask.dataframe preview is a bit unweildy, but here's the head and tail:
In [10]: df_full.head()
Out[10]:
1min_col_0 1min_col_1 1min_col_2 1min_col_3 1min_col_4 1min_col_5 1min_col_6 1min_col_7 ... D_col_226 D_col_227 D_col_228 D_col_229 D_col_230 D_col_231 D_col_232 D_col_233
2019-12-09 04:00:00 88 36 34 57 54 98 4 92 ... 84 3 49 29 62 47 21 21
2019-12-09 04:01:00 89 61 50 2 73 44 49 33 ... 84 3 49 29 62 47 21 21
2019-12-09 04:02:00 9 18 73 76 28 17 10 49 ... 84 3 49 29 62 47 21 21
2019-12-09 04:03:00 59 73 92 28 32 8 24 85 ... 84 3 49 29 62 47 21 21
2019-12-09 04:04:00 40 54 23 5 52 63 61 64 ... 84 3 49 29 62 47 21 21
[5 rows x 936 columns]
In [11]: df_full.tail()
Out[11]:
1min_col_0 1min_col_1 1min_col_2 1min_col_3 1min_col_4 1min_col_5 1min_col_6 1min_col_7 ... D_col_226 D_col_227 D_col_228 D_col_229 D_col_230 D_col_231 D_col_232 D_col_233
2020-12-11 05:15:00 81 8 51 2 77 26 66 23 ... 15 51 66 26 88 85 91 65
2020-12-11 05:16:00 67 68 34 58 43 40 76 72 ... 15 51 66 26 88 85 91 65
2020-12-11 05:17:00 93 66 21 39 12 96 53 4 ... 15 51 66 26 88 85 91 65
2020-12-11 05:18:00 69 9 69 41 5 6 6 37 ... 15 51 66 26 88 85 91 65
2020-12-11 05:19:00 18 50 25 74 78 51 10 83 ... 15 51 66 26 88 85 91 65
[5 rows x 936 columns]
I'm trying to get groupby stats with additional math operations between the aggregations
I tried
...agg({
'id':"count",
'repair':"count",
('repair':"count")/('id':"count")
})
yr id repair
2016 37 27
2017 53 28
After grouping I'm able to get this stat by
gr['repair']/gr['id']*100
yr
2016 0.73
2017 0.53
How can I get this type of calculation within the groupby?
Consider a custom function that returns an aggregated data set:
def agg_func(g):
g['id'] = g['id'].count()
g['repair'] = g['repair'].count()
g['repair_per_id'] = (g['repair'] / g['id']) * 100
return g.aggregate('max') # CAN ALSO USE: min, max, mean, median, mode
agg_df = (df.groupby(['group'])
.apply(agg_func)
.reset_index(drop=True)
)
To demonstrate with seeded, random data:
import numpy as np
import pandas as pd
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
np.random.seed(8192019)
random_df = pd.DataFrame({'group': np.random.choice(data_tools, 500),
'id': np.random.randint(1, 10, 500),
'repair': np.random.uniform(0, 100, 500)
})
# RANDOMLY ASSIGN NANs
random_df['repair'].loc[np.random.choice(random_df.index, 75)] = np.nan
# RUN AGGREGATIONS
agg_df = (random_df.groupby(['group'])
.apply(agg_func)
.reset_index(drop=True)
)
print(agg_df)
# group id repair repair_per_id
# 0 julia 79 70 88.607595
# 1 python 89 74 83.146067
# 2 r 82 69 84.146341
# 3 sas 74 66 89.189189
# 4 spss 77 69 89.610390
# 5 stata 99 84 84.848485
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))
I have vehicle information that I want to evaluate over several different time periods and I'm modifying different columns in the DataFrame as I move through the information. I'm working with the current and previous time periods so I need to concat the two and work on them together.
The problem I'm having is when I use the 'time' column as a index in pandas and loop through the data the object that is returned is either a DataFrame or a Series depending on number of vehicles (or rows) in the time period. This change in object type creates a error as I'm trying to use DataFrame methods on Series objects.
I created a small sample program that shows what I'm trying to do and the error that I'm receiving. Note this is a sample and not the real code. I have tried just simple querying the data by time period instead of using a index and that works but it is too slow for what I need to do.
import pandas as pd
df = pd.DataFrame({
'id' : range(44, 51),
'time' : [99,99,97,97,96,96,100],
'spd' : [13,22,32,41,42,53,34],
})
df = df.set_index(['time'], drop = False)
st = True
for ind in df.index.unique():
data = df.ix[ind]
print data
if st:
old_data = data
st = False
else:
c = pd.concat([data, old_data])
#do some work here
OUTPUT IS:
id spd time
time
99 44 13 99
99 45 22 99
id spd time
time
97 46 32 97
97 47 41 97
id spd time
time
96 48 42 96
96 49 53 96
id 50
spd 34
time 100
Name: 100, dtype: int64
Traceback (most recent call last):
File "C:/Users/m28050/Documents/Projects/fhwa/tca/v_2/code/pandas_ind.py", line 24, in <module>
c = pd.concat([data, old_data])
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 873, in concat
return op.get_result()
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 946, in get_result
new_data = com._concat_compat([x.values for x in self.objs])
File "C:\Python27\lib\site-packages\pandas\core\common.py", line 1737, in _concat_compat
return np.concatenate(to_concat, axis=axis)
ValueError: all the input arrays must have same number of dimensions
If anyone has the correct way to loop through the DataFrame and update the columns or can point out a different method to use, that would be great.
Thanks for your help.
Jim
I think groupby could help here:
In [11]: spd_lt_40 = df1[df1.spd < 40]
In [12]: spd_lt_40_count = spd_lt_40.groupby('time')['id'].count()
In [13]: spd_lt_40_count
Out[13]:
time
97 1
99 2
100 1
dtype: int64
and then set this to a column in the original DataFrame:
In [14]: df1['spd_lt_40_count'] = spd_lt_40_count
In [15]: df1['spd_lt_40_count'].fillna(0, inplace=True)
In [16]: df1
Out[16]:
id spd time spd_lt_40_count
time
99 44 13 99 2
99 45 22 99 2
97 46 32 97 1
97 47 41 97 1
96 48 42 96 0
96 49 53 96 0
100 50 34 100 1