I have a fairly large pandas dataframe df. I also have a pandas series of scale factors factors.
I want to scale df for every scale factor in factors and concatenate these dataframes together into a larger dataframe. Since this large dataframe will not fit into memory, I thought it may be good to use dask dataframe for the same. But I dont know how to get around this problem.
Below is what i want to achieve, but using pandas dataframes. The dflarge in the actual case will not fit in memory.
import random
import pandas as pd
df = pd.DataFrame({
'id1': range(1,6),
'a': [random.random() for i in range(5)],
'b': [random.random() for i in range(5)],
})
df = df.set_index('id1')
factors = [random.random() for i in range(10)]
dflist = []
for i, factor in enumerate(factors):
scaled = df*factor
scaled['id2'] = i
dflist.append(scaled)
dflarge = pd.concat(dflist)
dflarge = dflarge.reset_index().set_index(['id1', 'id2'])
I would like to make the scaling and concatenating as efficient as possible since there will be tens of thousands of scale factors. I'd like to run it distributed if possible.
I really appreciate any kind of help you can give.
Just delay it!
Dask.dataframe and dask.delayed are what you need here, and running it using dask.distributedshould work fine. Assuming that df is still a pandas.DataFrame, turn the loop into a function that you can call in a list comprehension using dask.delayed. I've made some small changes to your code below:
import random
import pandas as pd
import dask.dataframe as dd
from dask import delayed
df = pd.DataFrame({
'id1': range(1,6),
'a': [random.random() for i in range(5)],
'b': [random.random() for i in range(5)],
})
df = df.set_index('id1')
factors = [random.random() for i in range(10)]
dflist = []
def scale_my_df(df_init, scale_factor, id_num):
'''
Scales and returns a DataFrame.
'''
df_scaled = df_init * scale_factor
df_scaled['id2'] = id_num
return df_scaled
dfs_delayed = [delayed(scale_my_df)(df_init=df, scale_factor=factor, id_num=i)
for i, factor in enumerate(factors)]
ddf = dd.from_delayed(dfs_delayed)
And now you have a dask.DataFrame built from your scaled pandas.DataFrames. Two things of note:
Dask is lazy, so as of the end of this code snippet nothing has been computed. A computational graph has been setup with the required operations to create the DataFrame you want. In this example with small DataFrames, you could execute:
ddf_large = ddf.compute()
And you will have the same pandas.DataFrame as dflarge in your code above, assuming the factors are the same. Almost...
As of this writing dask does not appear to support multi-level indices, so your .set_index(['id1', 'id2']) code will not work. This has been raised in issue #1493 and there are some workarounds if you really need a multi-level index.
EDIT:
If the original data df is really large, as in already maxing your memory, converting it to a .csvor other pandas-readable format, and build that into the scale function might be necessary, i.e:
def scale_my_df(df_filepath, scale_factor, id_num):
'''
Scales and returns a DataFrame.
'''
df_init = pd.read_csv(df_filepath)
df_scaled = df_init * scale_factor
df_scaled['id2'] = id_num
return df_scaled
And adjust the rest of the code accordingly. The idea of dask is to keep the data out of memory, but there is some overhead involved with building the computational graph and holding intermediate values.
Related
I'm trying to replicate the below pandas group by rolling mean logic in dask. But stuck at 1) how to specify time period in days and 2) how to assign it back into the original frame?
df['avg3d']=df.groupby('g')['v'].transform(lambda x: x.rolling('3D').mean())
Get errors like:
ValueError: index must be monotonic, ValueError: Not all divisions are known, can't align partitions
or ValueError: cannot reindex from a duplicate axis
Full example
import pandas as pd
import dask.dataframe
df1 = pd.DataFrame({'g':['a']*10,'v':range(10)},index=pd.date_range('2020-01-01',periods=10))
df2=df1.copy()
df2['g']='b'
df = pd.concat([df1,df2]).sort_index()
df['avg3d']=df.groupby('g')['v'].transform(lambda x: x.rolling('3D').mean())
ddf = dask.dataframe.from_pandas(df, npartitions=4)
# works
ddf.groupby('g')['v'].apply(lambda x: x.rolling(3).mean(), meta=('avg3d', 'f8')).compute()
# rolling time period fails
ddf.groupby('g')['v'].apply(lambda x: x.rolling('3D').mean(), meta=('avg3d', 'f8')).compute()
# how do I add it to the rest of the data??
# neither of these work
ddf['avg3d']=ddf.groupby('g')['v'].apply(lambda x: x.rolling('3D').mean(), meta=('x', 'f8'))
ddf['avg3d']=ddf.groupby('g')['v'].transform(lambda x: x.rolling(3).mean(), meta=('x', 'f8'))
ddft = ddf.merge(ddf3d)
ddf.assign(avg3d=ddf.groupby('g')['v'].transform(lambda x: x.rolling(3).mean(), meta=('x', 'f8')))
Looked at
dask groupby apply then merge back to dataframe
Dask rolling function by group syntax
Compute the rolling mean over the last n days in Dask
ValueError: Not all divisions are known, can't align partitions error on dask dataframe
This problem arises due to the current implementation of .groupby in dask. The answer below is not a complete solution, but will hopefully explain why the error is happening.
First, let's make sure we get a true_result against which we can compare the dask results:
import dask.dataframe
import pandas as pd
df1 = pd.DataFrame(
{"g": ["a"] * 10, "v": range(10)}, index=pd.date_range("2020-01-01", periods=10)
)
df = pd.concat([df1, df1.assign(g="b")]).sort_index()
df["avg3d"] = df.groupby("g")["v"].transform(lambda x: x.rolling("3D").mean())
true_result = df["avg3d"].array
Now, running the code that is commented with #works is going to generate different values every time, even though the data or computations do not have a source of randomness:
ddf = dask.dataframe.from_pandas(df, npartitions=4)
# this doesn't work
dask_result_1 = ddf.groupby("g")["v"].apply(
lambda x: x.rolling(3).mean(), meta=("avg3d", "f8")
).compute().array
# this will fail, every time for a different reason
assert all(dask_result_1 == true_result)
Why is this happening? Well, under the hood, dask will want to shuffle data around to make sure that all the values of the groupby variable are in a single partition. This shuffling seems to be random, so when the values are stitched back together they can be out of original order.
So a quick way to fix this is to add sorting before rolling computation:
# rolling time period works
avg3d_dask = (
ddf.groupby("g")["v"]
.apply(lambda x: x.sort_index().rolling("3D").mean(), meta=("avg3d", "f8"))
.compute()
.droplevel(0)
.sort_index()
)
# this will always pass
assert all(avg3d_dask == true_result)
Now, how do we add this to the original datafame? I don't know a simple way of doing this, but one of the hard ways would be to calculate partitions of the original dask dataframe and then split the data into appropriate chunks and assign. This approach however is not very robust (or at least requires a lot of use-case specific fine-tuning), so hopefully someone can provide a better solution for this part.
I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so
import pandas as pd
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
df2 = df.loc[df['itemid'] == i]
df2 = df2.sort_values(by='Date',ascending=True)
trainvals = df2[:int(len(df2)*0.8)]
testvals = df2[int(len(df2)*0.8):]
train.append(trainvals)
test.append(testvals)
It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?
Your immediate issue is not re-assigning inside for-loop:
train = train.append(trainvals)
test = test.append(testvals)
However, it becomes memory inefficient to grow extensive objects like data frames in a loop. Instead, consider iterating across groupby to build a list of dictionaries containing test and train splits via list comprehension. Then call pd.concat to bind each set together. Use a defined method to organize processing.
def split_dfs(df):
df = df.sort_values(by='Date')
trainvals = df[:int(len(df)*0.8)]
testvals = df[int(len(df)*0.8):]
return {'train': trainvals, 'test': testvals}
dfs = [split_dfs(df) for g,df in df.groupby['itemid']]
train_df = pd.concat([x['train'] for x in dfs])
test_df = pd.concat(x['test'] for x in dfs])
You can avoid the loop with df.groupby.quantile.
train = df.groupby('itemid').quantile(0.8)
test = df.loc[~df.index.isin(train.index), :] # all rows not in train
Note this could have unexpected behavior if df.index is not unique.
What would be the fastest way to convert a Redis Stream output (aioredis client/ hiredis parser) to a Pandas Dataframe where Redis Stream ID‘s timestamp and sequence number as well as values are proper type converted Pandas index columns?
Example Redis output:
[[b'1554900384437-0', [b'key', b'1']],
[b'1554900414434-0', [b'key', b'1']]]
There seem to be two main bottlenecks here:
Pandas DataFrames store their data in column-major format, meaning each column maps to one numpy array, whereas the Redis stream data is row-by-row.
Pandas MultiIndex is made for categorical data, and converting raw arrays to the required levels/code structure seems to be non-optimized
Due to number 1. it is inevitable to loop over all Redis stream entries. Assuming we know the length beforehand, we can pre-allocate numpy arrays that we fill as we go along, and with some tricks reuse these arrays as the DataFrame columns. If the overhead of looping in Python is still too much, rewriting in Cython should be straightforward.
Since you didn't specify datatypes, the answer keeps everything in bytes using numpy.object arrays, it should be reasonably obvious how to adapt to a custom setting. The only reason to put all of the columns in the same array is to move an inner loop over the columns/fields from Python to C. It can be split up into e.g. one array per data type or one array per column.
from functools import partial, reduce
import numpy as np
import pandas as pd
data = [[b'1554900384437-0', [b'foo', b'1', b'bar', b'2', b'bla', b'abc']],
[b'1554900414434-0', [b'foo', b'3', b'bar', b'4', b'bla', b'xyz']]]
colnames = data[0][1][0::2]
ncols = len(colnames)
nrows = len(data)
ts_seq = np.empty((2, nrows), dtype=np.int64)
cols = np.empty((ncols, nrows), dtype=np.object)
for i,(id,fields) in enumerate(data):
ts, seq = id.split(b"-", 2)
ts_seq[:, i] = (int(ts), int(seq))
cols[:, i] = fields[1::2]
colframes = [pd.DataFrame(cols[i:i+1, :].T) for i in range(ncols)]
merge = partial(pd.merge, left_index=True, right_index=True, copy=False)
df = reduce(merge, colframes[1:], colframes[0])
df.columns = colnames
For number 2. we can use numpy.unique to create the levels/codes structure needed by Pandas MultiIndex. From the documentation it seems that numpy.unique also sorts the data. Since our data is presumably already sorted, a possible future optimisation would be to try to skip the sorting step.
ts = ts_seq[0, :]
seq = ts_seq[1, :]
maxseq = np.max(seq)
ts_levels, ts_codes = np.unique(ts, return_inverse=True)
seq_levels = np.arange(maxseq+1)
seq_codes = seq
df.index = pd.MultiIndex(levels=[ts_levels, seq_levels], codes=[ts_codes, seq_codes], names=["Timestamp", "Seq"])
Finally, we can verify that there was no copying involved by doing
cols[0, 0] = b'79'
and checking that the entries in df do indeed change.
The quickest way is to process data using batches
IO in batches of N msgs (i.e. 100 messages per batch)
Convert this batch into 1 Dataframe (using pd.DataFrame([]))
Apply lambda or convertation function to timestamp column converted to numpy (.values). a-la:
df['time'] = [datetime.fromtimestamp(t.split('-')[0]) for t in df['time'].values]
you can use this:
pd.read_msgpack(redisConn.get("key"))
This question is the intended solution to apply lambda function to a dask dataframe. This solution that does not require a pandas dataframe to implement. The reason behind this is I have a larger than memory dataframe and loading it to memory will not work as is done in pandas. (pandas is really good if data fits in memory).
The solution to the linked question is below.
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']}) #How to read this sort of format directly to dask dataframe?
ddf = dd.from_pandas(df, npartitions=2) # dask conversion
list1 = ['A','B','C'] #list1 of hearder names
for c in list1:
vc = ddf[c].value_counts().compute()
vc /= vc.sum()
print(vc) # A table with the proportion of unique values
for i in range(vc.count()):
if vc[i]<0.5: # Checks whether the varaible value has a proportion of less than .5
ddf[c] = ddf[c].where(ddf[c] != vc.index[i], 'others') #changes such variable value to 'others' (iterates though all clumns mentioned in list1)
print(ddf.compute()) #shows how changes have been implemented column by column
However, the second for loop takes a very long time compute in the actual (larger than memory) dataframe. Is there a more efficient way of getting the same output using dask.
The objective of the code is to change the column variable value to others for labels that have appeared less than 50% of the time in the column. For example if the value ant has appeared less than 50% of the time in a column then change the name to others
Would anyone be able to help me with this regard.
Thanks
Michael
Here is a way to skip your nested loop:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'],
'B':['cat','peach', 'cat', 'cat', 'peach'],
'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)
l = len(ddf)
for col in ddf.columns:
vc = (ddf[col].value_counts()/l)
vc = vc[vc>.5].index.compute()
ddf[col] = ddf[col].where(ddf[col].isin(vc), "other")
ddf = ddf.compute()
If you have a really big dataframe and it is on a parquet format you can try to read it column by column and save the result to different files. At the end you can just concatenate them horizontally.
I have this code which generates autoregressive terms within each unique combination of variables 'grouping A' and 'grouping B'.
for i in range(1, 5):
df.loc[:,'var_' + str(i)] = df.sort_values(by='date']) \
.groupby(['grouping A', 'grouping B']) \
['target'].sum().shift(i).ffill().bfill().values
Is it possible to sort values, group, shift, and then assign to a new variable without computing in Dask?
Dask.delayed
So if you want to just parallelize the for loop you might do the following with dask.delayed
ddf = dask.delayed(df)
results = []
for i in range(1, 5):
result = ddf.sort_values(by='date']) \
.groupby(['grouping A', 'grouping B']) \
['target'].sum().shift(i).ffill().bfill().values
results.append(result)
results = dask.compute(results)
for i, result in results:
df[...] = result # mutate dataframe as you like
That is we wrap the dataframe in dask.delayed. Any method call on it will be lazy. We collect up all of these lazy method calls and then call them together with dask.compute. We don't want to mutate the dataframe during this period (that would be weird) so we do it afterwards.
Large dataframe
If you want to do this with a large dataframe then you would probably want to use dask.dataframe instead. This will be less straightforward, but will hopefully work decently well. You should really look out for the sort_values operation. Distributed sorting is a very hard problem and very expensive. You want to minimize this if possible.
import dask.dataframe as dd
df = load distributed dataframe with `dd.read_csv`, `dd.read_parquet`, etc.
df = df.set_index('date').persist()
results = []
for i in range(1, 5):
results = ddf.groupby(['grouping A', 'grouping B']) \
['target'].sum().shift(i).ffill().bfill()
ddf2 = dd.concat([ddf] + results, axis=1)
Here we use set_index rather than sort_values and we make sure to do it exactly once (it's likely to take 10-100x longer than any other operation here). We then use normal groupby etc.. syntax and things should be fine (although I have to admit I haven't verified that ffill and bfill are definitely implement. I assume so though. As before we don't want to mutate our data during computation (this is weird) so we do a concat afterwards.
Maybe simpler
Probably you'll get a greatly reduced dataframe after the groupby-sum. Use Dask.dataframe for this and then ditch Dask and head back to the comfort of Pandas
ddf = load distributed dataframe with `dd.read_csv`, `dd.read_parquet`, etc.
pdf = ddf.groupby(['grouping A', 'grouping B']).target.sum().compute()
... do whatever you want with a much smaller pandas dataframe ...