Related
I'm not quite sure how to ask this question, but I need some clarification on how to make use of Dask's ability to "handle datasets that don't fit into memory", because I'm a little confused on how it works from the CREATION of these datasets.
I have made a reproducible code below that closely emulates my problem. Although this example DOES fit into my 16Gb memory, we can assume that it doesn't because it does take up ALMOST all of my RAM.
I'm working with 1min, 5min, 15min and Daily stock market datasets, all of which have their own technical indicators, so each of these separate dataframes are 234 columns in width, with the 1min dataset having the most rows (521,811), and going down from there. Each of these datasets can be created and fit into memory on their own, but here's where it gets tricky.
I'm trying to merge them column-wise into 1 dataframe, each column prepended with their respective timeframes so I can tell them apart, but this creates the memory problem. This is what I'm looking to accomplish visually:
I'm not really sure if Dask is what I need here, but I assume so. I'm NOT looking to use any kind of "parallel calculations" here (yet), I just need a way to create this dataframe before feeding it into a machine learning pipeline (yes, I know it's a stock market problem, just overlook that for now). I know Dask has a machine learning pipeline I can use, so maybe I'll make use of that in the future, however I need a way to save this big dataframe to disk, or create it upon importing it on the fly.
What I need help with is how to do this. Seeing as each of these datasets on their own fit into memory nicely, an idea I had (and this may not be correct at all so please let me know), would be to save each of the dataframes to separate parquet files to disk, then create a Dask dataframe object to import each of them into, when I go to start the machine learning pipeline. Something like this:
Is this conceptually correct with what I need to do, or am I way off? haha. I've read through the documentation on Dask, and also checked out this guide specifically, which is good, however as a newbie I need some guidance with how to do this for the first time.
How can I create and save this big merged dataframe to disk, if I can't create it in memory in the first place?
Here is my reproducible dataframe/memory problem code. Be careful when you go to run this as it'll eat up your RAM pretty quickly, I have 16Gb of RAM and it does run on my fairly light machine, but not without some red-lining RAM, just wanted to give the Dask gods out there something specific to work with. Thanks!
from pandas import DataFrame, date_range, merge
from numpy import random
# ------------------------------------------------------------------------------------------------ #
# 1 MINUTE DATASET #
# ------------------------------------------------------------------------------------------------ #
ONE_MIN_NUM_OF_ROWS = 521811
ONE_MIN_NUM_OF_COLS = 234
main_df = DataFrame(random.randint(0,100, size=(ONE_MIN_NUM_OF_ROWS, ONE_MIN_NUM_OF_COLS)),
columns=list("col_" + str(x) for x in range(ONE_MIN_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="min", periods=ONE_MIN_NUM_OF_ROWS))
# ------------------------------------------------------------------------------------------------ #
# 5 MINUTE DATASET #
# ------------------------------------------------------------------------------------------------ #
FIVE_MIN_NUM_OF_ROWS = 117732
FIVE_MIN_NUM_OF_COLS = 234
five_min_df = DataFrame(random.randint(0,100, size=(FIVE_MIN_NUM_OF_ROWS, FIVE_MIN_NUM_OF_COLS)),
columns=list("5_min_col_" + str(x) for x in range(FIVE_MIN_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="5min", periods=FIVE_MIN_NUM_OF_ROWS))
# Merge the 5 minute to the 1 minute df
main_df = merge(main_df, five_min_df, how="outer", left_index=True, right_index=True, sort=True)
# ------------------------------------------------------------------------------------------------ #
# 15 MINUTE DATASET #
# ------------------------------------------------------------------------------------------------ #
FIFTEEN_MIN_NUM_OF_ROWS = 117732
FIFTEEN_MIN_NUM_OF_COLS = 234
fifteen_min_df = DataFrame(random.randint(0,100, size=(FIFTEEN_MIN_NUM_OF_ROWS, FIFTEEN_MIN_NUM_OF_COLS)),
columns=list("15_min_col_" + str(x) for x in range(FIFTEEN_MIN_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="15min", periods=FIFTEEN_MIN_NUM_OF_ROWS))
# Merge the 15 minute to the main df
main_df = merge(main_df, fifteen_min_df, how="outer", left_index=True, right_index=True, sort=True)
# ------------------------------------------------------------------------------------------------ #
# DAILY DATASET #
# ------------------------------------------------------------------------------------------------ #
DAILY_NUM_OF_ROWS = 933
DAILY_NUM_OF_COLS = 234
fifteen_min_df = DataFrame(random.randint(0,100, size=(DAILY_NUM_OF_ROWS, DAILY_NUM_OF_COLS)),
columns=list("daily_col_" + str(x) for x in range(DAILY_NUM_OF_COLS)),
index=date_range(start="2019-12-09 04:00:00", freq="D", periods=DAILY_NUM_OF_ROWS))
# Merge the daily to the main df (don't worry about "forward peaking" dates)
main_df = merge(main_df, fifteen_min_df, how="outer", left_index=True, right_index=True, sort=True)
# ------------------------------------------------------------------------------------------------ #
# FFILL NAN's #
# ------------------------------------------------------------------------------------------------ #
main_df = main_df.fillna(method="ffill")
# ------------------------------------------------------------------------------------------------ #
# INSPECT #
# ------------------------------------------------------------------------------------------------ #
print(main_df)
UPDATE
Thanks to the top answer below, I'm getting closer to my solution.
I've fixed a few syntax errors in the code, and have a working example, UP TO the daily timeframe. When I use the 1B timeframe for upsampling to business days, the error is:
ValueError: <BusinessDay> is a non-fixed frequency
I think it has something to do with this line:
data_index = higher_resolution_index.floor(data_freq).drop_duplicates()
...as that's what I see in the traceback. I don't think Pandas likes the 1B timeframe and the floor() function, so is there an alternative?
I need to have daily data in there too, however the code works for every other timeframe. Once I can get this daily thing figured out, I'll be able to apply it to my use case.
Thanks!
from pandas import DataFrame, concat, date_range
from numpy import random
import dask.dataframe as dd, dask.delayed
ROW_CHUNK_SIZE = 5000
def load_data_subset(start_date, freq, data_freq, hf_periods):
higher_resolution_index = date_range(start_date, freq=freq, periods=hf_periods)
data_index = higher_resolution_index.floor(data_freq).drop_duplicates()
dummy_response = DataFrame(
random.randint(0, 100, size=(len(data_index), 234)),
columns=list(
f"{data_freq}_col_" + str(x) for x in range(234)
),
index=data_index
)
dummy_response = dummy_response.loc[higher_resolution_index.floor(data_freq)].set_axis(higher_resolution_index)
return dummy_response
#dask.delayed
def load_all_columns_for_subset(start_date, freq, hf_periods):
return concat(
[
load_data_subset(start_date, freq, "1min", hf_periods),
load_data_subset(start_date, freq, "5min", hf_periods),
load_data_subset(start_date, freq, "15min", hf_periods),
load_data_subset(start_date, freq, "1H", hf_periods),
load_data_subset(start_date, freq, "4H", hf_periods),
load_data_subset(start_date, freq, "1B", hf_periods),
],
axis=1,
)
ONE_MIN_NUM_OF_ROWS = 521811
full_index = date_range(
start="2019-12-09 04:00:00",
freq="1min",
periods=ONE_MIN_NUM_OF_ROWS,
)
df = dask.dataframe.from_delayed([load_all_columns_for_subset(full_index[i], freq="1min", hf_periods=ROW_CHUNK_SIZE) for i in range(0, ONE_MIN_NUM_OF_ROWS, ROW_CHUNK_SIZE)])
# Save df to parquet here when ready
I would take the dask.dataframe tutorial and look at the dataframe best practices guide. dask can work with larger-than-memory datasets generally by one of two approaches:
design your job ahead of time, then iterate through partitions of the data, writing the outputs as you go, so that not all of the data is in memory at the same time.
use a distributed cluster to leverage more (distributed) memory than exists on any one machine.
It sounds like you're looking for approach (1). The actual implementation will depend on how you access/generate the data, but generally I'd say you should not think of the job as "generate the larger-than-memory dataset in memory then dump it into the dask dataframe". Instead, you'll need to think carefully about how to load the data partition-by-partition, so that each partition can work independently.
Modifying your example, the full workflow might look something like this:
import pandas as pd, numpy as np, dask.dataframe, dask.delayed
#dask.delayed
def load_data_subset(start_date, freq, periods):
# presumably, you'd query some API or something here
dummy_ind = pd.date_range(start_date, freq=freq, periods=periods)
dummy_response = pd.DataFrame(
np.random.randint(0, 100, size=(len(dummy_ind), 234)),
columns=list("daily_col_" + str(x) for x in range(234)),
index=dummy_ind
)
return dummy_response
# generate a partitioned dataset with a configurable frequency, with each dataframe having a consistent number of rows.
FIFTEEN_MIN_NUM_OF_ROWS = 117732
full_index = pd.date_range(
start="2019-12-09 04:00:00",
freq="15min",
periods=FIFTEEN_MIN_NUM_OF_ROWS,
)
df_15min = dask.dataframe.from_delayed([
load_data_subset(full_index[i], freq="15min", periods=10000)
for i in range(0, FIFTEEN_MIN_NUM_OF_ROWS, 10000)
])
You could now write these to disk, concat, etc, and at any given point, each dask worker will only be working with 10,000 rows at a time. Ideally, you'll design the chunks so each partition will have a couple hundred MBs each - see the best practices section on partition sizing.
This could be extended to include multiple frequencies like this:
import pandas as pd, numpy as np, dask.dataframe, dask.delayed
def load_data_subset(start_date, freq, data_freq, hf_periods):
# here's your 1min time series *for this partition*
high_res_ind = pd.date_range(start_date, freq=freq, periods=hf_periods)
# here's your lower frequency (e.g. 1H, 1day) index
# for the same period
data_ind = high_res_ind.floor(data_freq).drop_duplicates()
# presumably, you'd query some API or something here.
# Alternatively, you could read subsets of your pre-generated
# frequency files. this covers the same dates as the 1 minute
# dataset, but only has the number of periods in the lower-res
# time series
dummy_response = pd.DataFrame(
np.random.randint(0, 100, size=(len(data_ind), 234)),
columns=list(
f"{data_freq}_col_" + str(x) for x in range(234)
),
index=data_ind
)
# now, reindex to the shape of the new data (this does the
# forward fill step):
dummy_response = (
dummy_response
.loc[high_res_ind.floor(data_freq)]
.set_axis(high_res_ind)
)
return dummy_response
#dask.delayed
def load_all_columns_for_subset(start_date, periods):
return pd.concat(
[
load_data_subset(start_date, "1min", "1min", periods),
load_data_subset(start_date, "1min", "5min", periods),
load_data_subset(start_date, "1min", "15min", periods),
load_data_subset(start_date, "1min", "D", periods),
],
axis=1,
)
# generate a partitioned dataset with all columns, where lower
# frequency columns have been ffilled, with each dataframe having
# a consistent number of rows.
ONE_MIN_NUM_OF_ROWS = 521811
full_index = pd.date_range(
start="2019-12-09 04:00:00",
freq="1min",
periods=ONE_MIN_NUM_OF_ROWS,
)
df_full = dask.dataframe.from_delayed([
load_all_columns_for_subset(full_index[i], periods=10000)
for i in range(0, ONE_MIN_NUM_OF_ROWS, 10000)
])
This runs straight through for me. It also exports the full dataframe just fine if you call df_full.to_parquet(filepath) right after this. I ran this with a dask.distributed scheduler (running on my laptop) and kept an eye on the dashboard and total memory never exceeded 3.5GB.
Because there are so many columns the dask.dataframe preview is a bit unweildy, but here's the head and tail:
In [10]: df_full.head()
Out[10]:
1min_col_0 1min_col_1 1min_col_2 1min_col_3 1min_col_4 1min_col_5 1min_col_6 1min_col_7 ... D_col_226 D_col_227 D_col_228 D_col_229 D_col_230 D_col_231 D_col_232 D_col_233
2019-12-09 04:00:00 88 36 34 57 54 98 4 92 ... 84 3 49 29 62 47 21 21
2019-12-09 04:01:00 89 61 50 2 73 44 49 33 ... 84 3 49 29 62 47 21 21
2019-12-09 04:02:00 9 18 73 76 28 17 10 49 ... 84 3 49 29 62 47 21 21
2019-12-09 04:03:00 59 73 92 28 32 8 24 85 ... 84 3 49 29 62 47 21 21
2019-12-09 04:04:00 40 54 23 5 52 63 61 64 ... 84 3 49 29 62 47 21 21
[5 rows x 936 columns]
In [11]: df_full.tail()
Out[11]:
1min_col_0 1min_col_1 1min_col_2 1min_col_3 1min_col_4 1min_col_5 1min_col_6 1min_col_7 ... D_col_226 D_col_227 D_col_228 D_col_229 D_col_230 D_col_231 D_col_232 D_col_233
2020-12-11 05:15:00 81 8 51 2 77 26 66 23 ... 15 51 66 26 88 85 91 65
2020-12-11 05:16:00 67 68 34 58 43 40 76 72 ... 15 51 66 26 88 85 91 65
2020-12-11 05:17:00 93 66 21 39 12 96 53 4 ... 15 51 66 26 88 85 91 65
2020-12-11 05:18:00 69 9 69 41 5 6 6 37 ... 15 51 66 26 88 85 91 65
2020-12-11 05:19:00 18 50 25 74 78 51 10 83 ... 15 51 66 26 88 85 91 65
[5 rows x 936 columns]
When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())
I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.
The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?
Minimal code copied below. Thanks a lot!
PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.
import pandas as pd
import numpy as np
df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)
df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))
# This doesn't work:
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')
# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')
# This below works but is not ideal:
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']
mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)
mypiv['month'] = mypiv['new field'] % 100
You can group and then unstack.
>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
Or use pivot_table:
>>> df.pivot_table(
values='value',
index=['year', 'month'],
columns='item',
aggfunc=np.sum)
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
I believe if you include item in your MultiIndex, then you can just unstack:
df.set_index(['year', 'month', 'item']).unstack(level=-1)
This yields:
value
item item 1 item 2
year month
2004 1 21 277
2 43 244
3 12 262
4 80 201
5 22 287
6 52 284
7 90 249
8 14 229
9 52 205
10 76 207
11 88 259
12 90 200
It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.
The following worked for me:
mypiv = df.pivot(index=['year','month'],columns='item')[['values1','values2']]
thanks to gmoutso comment you can use this:
def multiindex_pivot(df, index=None, columns=None, values=None):
if index is None:
names = list(df.index.names)
df = df.reset_index()
else:
names = index
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
return df
usage:
df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')
You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:
(df
.infer_objects() # coerce to the intended column type
.rename_axis(None, axis=1)) # flatten column headers
I have written some code to compute a weighted average using pivot tables in pandas. However, I am not sure how to add the actual column which performs the weighted averaging (Add a new column where each row contains value of 'cumulative'/'COUNT').
The data looks like so:
VALUE COUNT GRID agb
1 43 1476 1051
2 212 1476 2983
5 7 1477 890
4 1361 1477 2310
Here is my code:
# Read input data
lup_df = pandas.DataFrame.from_csv(o_dir+LUP+'.csv',index_col=False)
# Insert a new column with area * variable
lup_df['cumulative'] = lup_df['COUNT']*lup_df['agb']
# Create and output pivot table
lup_pvt = pandas.pivot_table(lup_df, 'agb', rows=['GRID'])
# TODO: Add a new column where each row contains value of 'cumulative'/'COUNT'
lup_pvt.to_csv(o_dir+PIVOT+'.csv',index=True,header=True,sep=',')
How can I do this?
So you want, for each value of grid, the weighted average of the agb column where the weights are the values in the count column. If that interpretation is correct, I think this does the trick with groupby:
import numpy as np
import pandas as pd
np.random.seed(0)
n = 50
df = pd.DataFrame({'count': np.random.choice(np.arange(10)+1, n),
'grid': np.random.choice(np.arange(10)+50, n),
'value': np.random.randn(n) + 12})
df['prod'] = df['count'] * df['value']
grouped = df.groupby('grid').sum()
grouped['wtdavg'] = grouped['prod'] / grouped['count']
print grouped
count value prod wtdavg
grid
50 22 57.177042 243.814417 11.082474
51 27 58.801386 318.644085 11.801633
52 11 34.202619 135.127942 12.284358
53 24 59.340084 272.836636 11.368193
54 39 137.268317 482.954857 12.383458
55 47 79.468986 531.122652 11.300482
56 17 38.624369 214.188938 12.599349
57 22 38.572429 279.948202 12.724918
58 27 36.492929 327.315518 12.122797
59 34 60.851671 408.306429 12.009013
Or, if you want to be a bit slick and write a weighted average function you can use over and over:
import numpy as np
import pandas as pd
np.random.seed(0)
n = 50
df = pd.DataFrame({'count': np.random.choice(np.arange(10)+1, n),
'grid': np.random.choice(np.arange(10)+50, n),
'value': np.random.randn(n) + 12})
def wavg(val_col_name, wt_col_name):
def inner(group):
return (group[val_col_name] * group[wt_col_name]).sum() / group[wt_col_name].sum()
inner.__name__ = 'wtd_avg'
return inner
slick = df.groupby('grid').apply(wavg('value', 'count'))
print slick
grid
50 11.082474
51 11.801633
52 12.284358
53 11.368193
54 12.383458
55 11.300482
56 12.599349
57 12.724918
58 12.122797
59 12.009013
dtype: float64
I have vehicle information that I want to evaluate over several different time periods and I'm modifying different columns in the DataFrame as I move through the information. I'm working with the current and previous time periods so I need to concat the two and work on them together.
The problem I'm having is when I use the 'time' column as a index in pandas and loop through the data the object that is returned is either a DataFrame or a Series depending on number of vehicles (or rows) in the time period. This change in object type creates a error as I'm trying to use DataFrame methods on Series objects.
I created a small sample program that shows what I'm trying to do and the error that I'm receiving. Note this is a sample and not the real code. I have tried just simple querying the data by time period instead of using a index and that works but it is too slow for what I need to do.
import pandas as pd
df = pd.DataFrame({
'id' : range(44, 51),
'time' : [99,99,97,97,96,96,100],
'spd' : [13,22,32,41,42,53,34],
})
df = df.set_index(['time'], drop = False)
st = True
for ind in df.index.unique():
data = df.ix[ind]
print data
if st:
old_data = data
st = False
else:
c = pd.concat([data, old_data])
#do some work here
OUTPUT IS:
id spd time
time
99 44 13 99
99 45 22 99
id spd time
time
97 46 32 97
97 47 41 97
id spd time
time
96 48 42 96
96 49 53 96
id 50
spd 34
time 100
Name: 100, dtype: int64
Traceback (most recent call last):
File "C:/Users/m28050/Documents/Projects/fhwa/tca/v_2/code/pandas_ind.py", line 24, in <module>
c = pd.concat([data, old_data])
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 873, in concat
return op.get_result()
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 946, in get_result
new_data = com._concat_compat([x.values for x in self.objs])
File "C:\Python27\lib\site-packages\pandas\core\common.py", line 1737, in _concat_compat
return np.concatenate(to_concat, axis=axis)
ValueError: all the input arrays must have same number of dimensions
If anyone has the correct way to loop through the DataFrame and update the columns or can point out a different method to use, that would be great.
Thanks for your help.
Jim
I think groupby could help here:
In [11]: spd_lt_40 = df1[df1.spd < 40]
In [12]: spd_lt_40_count = spd_lt_40.groupby('time')['id'].count()
In [13]: spd_lt_40_count
Out[13]:
time
97 1
99 2
100 1
dtype: int64
and then set this to a column in the original DataFrame:
In [14]: df1['spd_lt_40_count'] = spd_lt_40_count
In [15]: df1['spd_lt_40_count'].fillna(0, inplace=True)
In [16]: df1
Out[16]:
id spd time spd_lt_40_count
time
99 44 13 99 2
99 45 22 99 2
97 46 32 97 1
97 47 41 97 1
96 48 42 96 0
96 49 53 96 0
100 50 34 100 1