Convert Dask Bag of Pandas DataFrames to a single Dask DataFrame - python

Summary of Problem
Short Version
How do I go from a Dask Bag of Pandas DataFrames, to a single Dask DataFrame?
Long Version
I have a number of files that are not readable by any of dask.dataframe's various read functions (e.g. dd.read_csv or dd.read_parquet). I do have my own function that will read them in as Pandas DataFrames (function only works on one file at a time, akin to pd.read_csv). I would like to have all of these single Pandas DataFrames in one large Dask DataFrame.
Minimum Working Example
Here's some example CSV data (my data isn't actually in CSVs, but using it here for ease of example). To create a minimum working example, you can save this as a CSV and make a few copies, then use the code below
"gender","race/ethnicity","parental level of education","lunch","test preparation course","math score","reading score","writing score"
"female","group B","bachelor's degree","standard","none","72","72","74"
"female","group C","some college","standard","completed","69","90","88"
"female","group B","master's degree","standard","none","90","95","93"
"male","group A","associate's degree","free/reduced","none","47","57","44"
"male","group C","some college","standard","none","76","78","75"
from glob import glob
import pandas as pd
import dask.bag as db
files = glob('/path/to/your/csvs/*.csv')
bag = db.from_sequence(files).map(pd.read_csv)
What I've tried so far
import pandas as pd
import dask.bag as db
import dask.dataframe as dd
# Create a Dask bag of pandas dataframes
bag = db.from_sequence(list_of_files).map(my_reader_function)
df = bag.map(lambda x: x.to_records()).to_dataframe() # this doesn't work
df = bag.map(lambda x: x.to_dict(orient = <any option>)).to_dataframe() # neither does this
# This gets me really close. It's a bag of Dask DataFrames.
# But I can't figure out how to concatenate them together
df = bag.map(dd.from_pandas, npartitions = 1)
df = dd.from_delayed(bag) # returns an error

I recommend using dask.delayed with dask.dataframe. There is a good example doing what you want to do here:
https://docs.dask.org/en/latest/delayed-collections.html

Here are two additional possible solutions:
1. Convert the bag to a list of dataframes then use dd.multi.concat:
bag #a dask bag of dataframes
list_of_dfs = bag.compute()
df = dd.multi.concat(list_of_dfs).compute()
2. Convert to a bag of dictionaries and use bag.to_dataframe:
bag_of_dicts = bag.map(lambda df: df.to_dict(orient='records')).flatten()
df = bag_of_dicts.to_dataframe().compute()
In my own specific use case, option #2 had better performance than option #1.

If you already have a bag of dataframes then you can do the following:
Convert bag to delayed partitions,
convert delayed partitions to delayeds of dataframes by concatenating,
create dataframe from these delayeds.
In python code:
def bag_to_dataframe(bag, **concat_kwargs):
partitions = bag.to_delayed()
dataframes = map(
dask.delayed(lambda partition: pandas.concat(partition, **concat_kwargs)),
partitions
)
return dask.dataframe.from_delayed(dataframes)
You might want to control the concatenation of partitions, for example to ignore the index.

Related

Merge several .csv into one csv in python

Good evening,
So I have a huge amount of .csvs which I either want to change in one giant csv before reading it with pandas, or directly creating a df with all the .csvs in it. The .csvs all have two columns "timestamp" and "holdings". Now I want to merge them on the "timestamp"-column if they match with each other and create a new column for each "holdings"-column. So far I produced this:
import os
import glob
import pandas as pd
os.chdir("C/USer....")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
The output is a list with dfs. How do I merge them on "timestamp" column now? I tried to concate and merge already, but it always puts them in a single column.
What you are looking for is an outer join between the dataframes. Since the pandas merge function only operates between two dataframes, we need to loop over each dataframe and merge them individually. We can use the reduce iterator from functools to do this cleanly in one line:
import pandas as pd
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['timestamp'],
how='outer'), dfs)
Use the suffixes argument in the merge function to clean up your column headings.

What is the fastest way to make a DataFrame from a list?

So basically im trying to transform a list into a DataFrame.
Here are the two ways of doing it that I am trying, but I cannot come up to a good performance benchmark.
import pandas as pd
mylist = [1,2,3,4,5,6]
names = ["name","name","name","name","name","name"]
# Way 1
pd.DataFrame([mylist], columns=names)
# Way 2
pd.DataFrame.from_records([mylist], columns=names)
I also tried dask but I did not find anything that could work for me.
so I just made up an example with 10 columns and random integers in the range of 1 Million values in those, i got the maximum result very quickly. Does this give you maybe a start to work with dask? They proposed the approach here which is also related to this question.
import dask.dataframe as dd
from dask.delayed import delayed
import pandas as pd
import numpy as np
# Create List with random integers
list_large = [np.random.random_sample(int(1e6))*i for i in range(10)]
# Convert it to dask dataframe
dfs = [delayed(pd.DataFrame)(i) for i in list_large]
df = dd.from_delayed(dfs)
# Calculate Maximum
max = df.max().compute()

How to make multiple json processing faster using pySpark?

I am having a list of json files within Databricks and what I am trying to do is to read each json, extract the values needed and then append that in an empty pandas dataframe. Each json file corresponds to one row on the final dataframe. The initial json filelist length is 50k. What I have built so far is the function below which does the job perfectly, but it takes so much time that it makes me subset the json filelist in 5k bins and run each one separately. It takes 30mins each. I am limited to use only a 3-node cluster in Databricks.
Any chance that you could improve the efficiency of my function? Thanks in advance.
### Create a big dataframe including all json files ###
def jsons_to_pdf(all_paths):
# Create an empty pandas dataframes (it is defined only with column names)
pdf = create_initial_pdf(samplefile)
# Append each row into the above dataframe
for path in all_paths:
# Create a spark dataframe
sdf = sqlContext.read.json(path)
# Create a two extracted lists of values
init_values = sdf.select("id","logTimestamp","otherTimestamp").rdd.flatMap(lambda x: x).collect()
id_values = sdf.select(sdf["dataPoints"]["value"]).rdd.flatMap(lambda x: x).collect()[0]
#Append the concatenated list each one as a row into the initial dataframe
pdf.loc[len(pdf)] = init_values + id_values
return pdf
One json file looks like the following:
And what I want to achieve is to have dataPoints['id'] as new columns and dataPoints['value'] as their value, so as to end up into this:
According to your example, what you want to perform is a pivot and then transform your data into a pandas dataframe.
The steps are :
Collect all you jsons into 1 big dataframe,
pivot your data,
transform them into a pandas dataframe
Try something like this :
from functools import reduce
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = reduce(
lambda a,b : a.union(b),
[
sqlContext.read.json(path)
for path
in all_paths
]
)
# select and pivot your data
pivot_df = sdf.select(
"imoNo",
"logTimestamp",
"payloadTimestamp",
F.explode("datapoints").alias("datapoint")
).groupBy(
"imoNo",
"logTimestamp",
"payloadTimestamp",
).pivot(
"datapoint.id"
).sum("datapoint.value")
# convert to a pandas dataframe
pdf = pivot_df.toPandas()
return pdf
According to your comment, you can replace the list of files all_paths with a generic path and change the way you create sdf:
all_paths = 'abc/*/*/*' # 3x*, one for year, one for month, one for day
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = sqlContext.read.json(path)
This will surely increase the performances.

How to apply funtion to single Column of large dataset using Dask?

If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that?
df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute()
The dataset is very large (125 Millions of rows), How can I do that?
You have a few options:
Use dask.array functions
Just like how your pandas dataframe can use numpy functions
import numpy as np
result = np.log1p(df.x)
Dask dataframes can use dask array functions
import dask.array as da
result = da.log1p(df.x)
Map Partitions
But maybe no such dask.array function exists for your particular function. You can always use map_partitions, to apply any function that you would normally do on pandas dataframes across all of the pandas dataframes that make up your dask dataframe
Pandas
result = f(df.x)
Dask DataFrame
result = df.x.map_partitions(f)
Map
You can always use the map or apply(axis=0) methods, but just like in Pandas these are usually very bad for performance.

Pandas Get a List Of All Data Frames loaded into memory

I am using pandas to read several csv files into memory for processing and at some point would like to list all the data frames I have loaded into memory. Is there a simple way to do that? (I am thinking something like %ls but only for the data frames that I have available in memory)
I personally think this approach is much better (if in ipython).
import pandas as pd
%whos DataFrame
You could list all dataframes with the following:
import pandas as pd
# create dummy dataframes
df1 = pd.DataFrame({'Col1' : list(range(100))})
df2 = pd.DataFrame({'Col1' : list(range(100))})
# check whether all variables in scope are pandas dataframe.
# Dir() will return a list of string representations of the variables.
# Simply evaluate and test whether they are pandas dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
print(alldfs) # df1, df2
building on previous answers ...
this returns a list
import pandas as pd
%who_ls DataFrame
however, if you try to run a script it doesn't work
thus
import pandas as pd
sheets=[]
for var in dir():
if isinstance(locals()[var], pd.core.frame.DataFrame) and var[0]!='_':
sheets.append(var)
since some DataFrames will have a copy for internal use only and those start with '_'
In case you want to have all the dataframes in a list which is itteratable you, you want to concatenate all dataframes, and their number will grow or names are going to change this is the way
#Output all the dataframe
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
#Create a list of itteratable dataframes
list_of_dfs = []
for df in alldfs:
list_of_dfs.append(locals()[df])
In case you have multiple dataframes, and there are ones which you dot not want to concatenate or perform other operations, you can put them in small dataframe, filter them and chose the desired ones.

Categories

Resources