If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that?
df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute()
The dataset is very large (125 Millions of rows), How can I do that?
You have a few options:
Use dask.array functions
Just like how your pandas dataframe can use numpy functions
import numpy as np
result = np.log1p(df.x)
Dask dataframes can use dask array functions
import dask.array as da
result = da.log1p(df.x)
Map Partitions
But maybe no such dask.array function exists for your particular function. You can always use map_partitions, to apply any function that you would normally do on pandas dataframes across all of the pandas dataframes that make up your dask dataframe
Pandas
result = f(df.x)
Dask DataFrame
result = df.x.map_partitions(f)
Map
You can always use the map or apply(axis=0) methods, but just like in Pandas these are usually very bad for performance.
Related
I am trying to winsorize a data set that would contain a few hundred columns of data. I'd like to make a new column to the dataframe and the column would contain the winsorized result from its row's data. How can I do this with a pandas dataframe without having to specify each column (I'd like to use all columns)?
Edit: I would want to use the function 'winsorize(list, limits = [0.1,0.1])' but I'm not sure how to format the dataframe rows to work as a list.
Some tips:
You may use the pandas function apply with axis=1 to apply a function to every row.
The apply function will receive a pandas Series object but you can easily convert it to a list using tolist method
For example:
df.apply(lambda x: winsorize(x.tolist(), limits=[0.1,0.1]), axis=1)
You can use the numpy version of your dataframe using to_numpy()
from scipy.stats.mstats import winsorize
ma = winsorize(df.to_numpy(), axis=1, limits=[0.1, 0.1])
out = pd.DataFrame(ma.data, index=df.index, columns=df.columns)
Summary of Problem
Short Version
How do I go from a Dask Bag of Pandas DataFrames, to a single Dask DataFrame?
Long Version
I have a number of files that are not readable by any of dask.dataframe's various read functions (e.g. dd.read_csv or dd.read_parquet). I do have my own function that will read them in as Pandas DataFrames (function only works on one file at a time, akin to pd.read_csv). I would like to have all of these single Pandas DataFrames in one large Dask DataFrame.
Minimum Working Example
Here's some example CSV data (my data isn't actually in CSVs, but using it here for ease of example). To create a minimum working example, you can save this as a CSV and make a few copies, then use the code below
"gender","race/ethnicity","parental level of education","lunch","test preparation course","math score","reading score","writing score"
"female","group B","bachelor's degree","standard","none","72","72","74"
"female","group C","some college","standard","completed","69","90","88"
"female","group B","master's degree","standard","none","90","95","93"
"male","group A","associate's degree","free/reduced","none","47","57","44"
"male","group C","some college","standard","none","76","78","75"
from glob import glob
import pandas as pd
import dask.bag as db
files = glob('/path/to/your/csvs/*.csv')
bag = db.from_sequence(files).map(pd.read_csv)
What I've tried so far
import pandas as pd
import dask.bag as db
import dask.dataframe as dd
# Create a Dask bag of pandas dataframes
bag = db.from_sequence(list_of_files).map(my_reader_function)
df = bag.map(lambda x: x.to_records()).to_dataframe() # this doesn't work
df = bag.map(lambda x: x.to_dict(orient = <any option>)).to_dataframe() # neither does this
# This gets me really close. It's a bag of Dask DataFrames.
# But I can't figure out how to concatenate them together
df = bag.map(dd.from_pandas, npartitions = 1)
df = dd.from_delayed(bag) # returns an error
I recommend using dask.delayed with dask.dataframe. There is a good example doing what you want to do here:
https://docs.dask.org/en/latest/delayed-collections.html
Here are two additional possible solutions:
1. Convert the bag to a list of dataframes then use dd.multi.concat:
bag #a dask bag of dataframes
list_of_dfs = bag.compute()
df = dd.multi.concat(list_of_dfs).compute()
2. Convert to a bag of dictionaries and use bag.to_dataframe:
bag_of_dicts = bag.map(lambda df: df.to_dict(orient='records')).flatten()
df = bag_of_dicts.to_dataframe().compute()
In my own specific use case, option #2 had better performance than option #1.
If you already have a bag of dataframes then you can do the following:
Convert bag to delayed partitions,
convert delayed partitions to delayeds of dataframes by concatenating,
create dataframe from these delayeds.
In python code:
def bag_to_dataframe(bag, **concat_kwargs):
partitions = bag.to_delayed()
dataframes = map(
dask.delayed(lambda partition: pandas.concat(partition, **concat_kwargs)),
partitions
)
return dask.dataframe.from_delayed(dataframes)
You might want to control the concatenation of partitions, for example to ignore the index.
I have a Pandas Dataframe (dataset, 889x4) and a Numpy ndarray (targets_one_hot, 889X29), which I want to concatenate. Therefore, I want to convert the targets_one_hot into a Pandas Dataframe.
To do so, I looked at several suggestions. However, these suggestions are about smaller arrays, for which it is okay to write out the different columns.
For 29 columns, this seems inefficient. Who can tell me efficient ways to turn this Numpy array into a Pandas DataFrame?
We can wrap a numpy array in a pandas dataframe, by passing it as the first parameter. Then we can make use of pd.concat(..) [pandas-doc] to concatenate the original dataset, and the dataframe of the target_one_hot into a new dataframe. Since we here concatenate "vertically", we need to set the axis parameter on axis=1:
pd.concat((dataset, pd.DataFrame(targets_one_hot)), axis=1)
I have to write an object that takes either a pandas data frame or a numpy array as the input (similar to sklearn behavior). In one of the methods for this object, I need to select the columns (not a particular fixed one, I get a few column indices based on other calculations).
So, to make my code compatible with both input types, I tried to find a common way to select columns and tried methods like X[:,0](doesn't work on pandas dataframes), X[0] and others but they select differently. Is there a way to select columns in a similar fashion across pandas and numpy?
If no then how does sklearn work across these data structures?
You can use an if condition within your method and have separate selection methods for pandas dataframes and numpy arrays. Given sample code below.
def method_1(self, var, col_indices):
if isinstance(var, pd.DataFrame):
selected_columns = var[var.columns[col_indices]]
else:
selected_columns = var[:,col_indices]
Here, var is your input which can be a numpy array or pandas dataframe, col_indices are the indices of the columns you want to select.
myfunc does some processes on a dataframe. I am trying to reduce computational time by vectorzing myfunc. Each dataframe are created by reading a very large text file(30 Gigs). I tried to create array of dataframes and then vectorize myfunc so that it can apply on array of dataframes, but the problem is that np.vectorize applies on each cell of a dataframe not on whole dataframe. Even though, I get some columns of dataframe as an array, np.vectorize applies myfunc on each cell inside an array not on whole array. I am not sure that is right way to solve this problem. Please share your thoughts. Thank you.
import numpy as np
import pandas as pd
def myfunc(a):
# Do some process on dataframe
return a*2
vecfunc = np.vectorize(myfunc)
x = pd.DataFrame(np.array([[1,2,3],[1,2,3]]))
y = pd.DataFrame(np.array([[1,2,3],[1,2,3]]))
z = pd.DataFrame(np.array([[1,2,3],[1,2,3]]))
result = vecfunc([x,y,z])
print(result)