Pandas Dataframe: to_dict() poor performance - python

I work with apis that return large pandas dataframes. I'm not aware of a fast way to iterate through the dataframe directly so I cast to a dictionary with to_dict().
After my data is in dictionary form, the performance is fine. However, the to_dict() operation tends to be a performance bottleneck.
I often group columns of the dataframe together to form multi-index and use the 'index' orientation for to_dict(). Not sure if the large multi-index drives the poor performance.
Is there a faster way to cast a pandas dataframe? Maybe there is a better way to iterate directly over the dataframe without any cast? Not sure if there is a way I could apply vectorization.
Below I give sample code which mimics the issue with timings:
import pandas as pd
import random as rd
import time
#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)
#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))
#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))
#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))
#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))
Thank you!

The common guidance is don't iterate, use functions on all rows columns, or grouped rows/columns. Below, in the third code block shows how to iterate over the numpy array whhich is the .values attribute. The results are:
Pivot Construction takes: 0.012315988540649414
Dataframe iteration takes: 0.32346272468566895
Iteration over values takes: 0.004369020462036133
Cast to dictionary takes: 0.023524761199951172
Iteration over dictionary takes: 0.0010480880737304688
import pandas as pd
from io import StringIO
# Test data
import pandas as pd
import random as rd
import time
#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)
#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))
#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))
#Iterate over all values in pivot table
t0 = time.time()
v = df_pivot.values
for row in range(df_pivot.shape[0]):
for column in range(df_pivot.shape[1]):
test = v[row, column]
t1 = time.time()
print('Iteration over values takes: ' + str(t1-t0))
#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))
#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))

Related

Parallelize python for loop on pandas dataframe and append the result

I have a pandas dataframe with 5M rows and 20+ columns. I want do some calculations in for loop as in below sample,
grp_list=df.GroupName.unique()
df2 = pd.DataFrame()
for g in grp_list:
tmp_df = df.loc[(df['GroupName']==g)]
for i in range(len(tmp_df.GroupName)):
# calls another function
res=my_func(tmp_df)
tmp_df['Result'] = res
df2 = df2.append(tmp_df, ignore_index=True)
There are ~900 distinct GroupName. In order to improve the performance, I want to parallelize the first for loop as it is independent for each GroupName and append the result to a output data frame. How can I effectively do it with multiprocessing with group by on GroupName with final output as a appended dataframe.
First, you can try:
out = []
for _, g in df.groupby("GroupName"):
res = my_func(g)
out.append(res)
final_df = pd.concat(out)
This should speed your computation significantly.
If you want to use multiprocessing (but it depends on your computation inside my_func if it speeds up the things) you can use next example:
import multiprocessing
def my_func(df):
# modify df here
# ...
return df
if __name__ == "__main__":
with multiprocessing.Pool() as pool:
groups = (g for _, g in df.groupby("GroupName"))
out = []
for res in pool.imap_unordered(my_func, groups):
out.append(res)
final_df = pd.concat(out)

Why dask doesnt execute in parallel

Could someone point out what I did wrong with following dask implementation, since it doesnt seems to use the multi cores.
[ Updated with reproducible code]
The code that uses dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
calculate_feature_stats = dask.delayed(calculate_feature_stats)
rows = []
for bookid in bookingID.tolist():
row = calculate_feature_stats(bookid)
rows.append(row)
start = time.time()
rows = dask.persist(*rows)
end = time.time()
print(end - start) # Execution time = 16s in my machine
Code with normal implementation without dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats_normal(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
rows = []
start = time.time()
for bookid in bookingID.tolist():
row = calculate_feature_stats_normal(bookid)
rows.append(row)
end = time.time()
print(end - start) # Execution time = 4s in my machine
So, without dask actually faster, how is that possible?
Answer
Extended comment. You should consider that using dask there is about 1ms overhead (see doc) so if your computation is shorther than that then dask It isn't worth the trouble.
Going to your specific question I can think of two possible real world scenario:
1. A big dataframe with a column called bookingID and another value
2. A different file for every bookingID
In the second case you can play from this answer while for the first case you can proceed as following:
import dask.dataframe as dd
import numpy as np
import pandas as pd
# create dummy df
df = []
for i in range(10_000):
df.append(pd.DataFrame({"id":i,
"value":np.random.rand(1000)}))
df = pd.concat(df, ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
df.to_parquet("df.parq")
Pandas
%%time
df = pd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":{"min", "max", "std", "mean"}})
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 1.65 s, sys: 316 ms, total: 1.96 s
Wall time: 1.08 s
Dask
%%time
df = dd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":["min", "max", "std", "mean"]}).compute()
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 4.94 s, sys: 427 ms, total: 5.36 s
Wall time: 3.94 s
Final thoughts
In this situation dask starts to make sense if the df doesn't fit in memory.

How to create a new dataframe by sorted data

I would like to find out the row which meets the condition RSI < 25.
However, the result is generated with one data frame. Is it possible to create separate dataframes for any single row?
Thanks.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as wb
stock='TSLA'
ck_df = wb.DataReader(stock,data_source='yahoo',start='2015-01-01')
rsi_period = 14
chg = ck_df['Close'].diff(1)
gain = chg.mask(chg<0,0)
ck_df['Gain'] = gain
loss = chg.mask(chg>0,0)
ck_df['Loss'] = loss
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
ck_df['Avg Gain'] = avg_gain
ck_df['Avg Loss'] = avg_loss
rs = abs(avg_gain/avg_loss)
rsi = 100-(100/(1+rs))
ck_df['RSI'] = rsi
RSIFactor = ck_df['RSI'] <25
ck_df[RSIFactor]
If you want to know at what index the RSI < 25 then just use:
ck_df[ck_df['RSI'] <25].index
The result will also be a dataframe. If you insist on making a new one then:
new_df = ck_df[ck_df['RSI'] <25].copy()
To split the rows found by #Omkar's solution into separate dataframes you might use this function taken from here: Pandas: split dataframe into multiple dataframes by number of rows;
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
dfs.append(df.iloc[start : count])
return dfs
With this you get a list of dataframes.

Spark DataFrame aggregation: windows + partitioning versus groupBy operations

I'm looking to perform about five different summarizing techniques over a lot of data. Generally, I'm looking to calculate mean, min, max, stddev, and sum over certain time windows and other dimensions.
Here's about as reproducible an example I can make:
import random
import string
import datetime
from pyspark.sql import SparkSession, functions as func
from pyspark.conf import SparkConf
from pyspark.sql.types import StringType, DoubleType, IntegerType
from pyspark.sql.window import Window
########## Setting up DataFrame ##########
def random_date(start, n):
current = start
for _ in range(n):
current = current + datetime.timedelta(seconds=random.randrange(60))
yield current
start_date = datetime.datetime(2013, 9, 20, 13, 00)
n_records = 50000000
dates = list(random_date(start_date, n_records))
other_data = []
for d in dates:
categorical_data = tuple(random.choice(string.ascii_lowercase) for _ in range(1))
numerical_data = tuple(random.randrange(100) for _ in range(20))
other_data.append(categorical_data + numerical_data + (d,))
categorical_columns = ['cat_{}'.format(n) for n in range(1)]
numerical_columns = ['num_{}'.format(n) for n in range(20)]
date_column = ['date']
columns = categorical_columns + numerical_columns + date_column
df = sc.parallelize(other_data).toDF(columns)
df = df.withColumn('date_window', func.window('date', '5 minutes'))
df.registerTempTable('df')
########## End DataFrame setup ##########
To date, I've tried two techniques: one using the built-in DataFrame.groupBy mechanism; the other using pyspark.sql.window.Window's orderBy and partitionBy methods.
Generally, the pipeline I've developed looks like this:
For each numeric column, group-by the categorical columns cat_0 and date_window, and calculate the five summary statistics listed previously.
For the pyspark.sql.window.Window approach, join the calculated column directly, using df.withColumn. For the DataFrame.groupBy approach, track each result DataFrame (each with three columns: two for the grouping columns, and one for the calculated column) - at the end, join each DataFrame by basically performing a reduce operation with the grouping columns as keys.
I've left some of the pipeline code below, but am primarily interested in opinions about 1) whether either of these are "best practices" for this type of work and 2) if not, am I missing something major within the Spark ecosystem that could help me do this much more quickly/with fewer resources?
Currently, the groupBy approach has performed a little bit better, but is kind of cumbersome with having to track each grouped-by DataFrame and reduce-join them all at the end. The Window approach hasn't been very great, although it's syntactically a little bit cleaner and more maintainable, IMO. In either case, I'm having to allocate a massive amount of compute to get the job to run and write to disk at the end (without repartitioning/coalescing).
gb_cols = ['cat_0', 'date_window']
strategies = {'sum', 'mean', 'stddev', 'max', 'min'}
Xcols = [col for col in df.columns if col.startswith('num')]
for col in Xcols[:]:
for s in strategies:
new_col = '{}_{}'.format(col, s)
Xcols.append(new_col)
if s == 'mean':
calc_col_series = func.mean(col)
elif s == 'stddev':
calc_col_series = func.stddev(col)
elif s == 'max':
calc_col_series = func.max(col)
elif s == 'min':
calc_col_series = func.min(col)
elif s == 'sum':
calc_col_series = func.sum(col)
elif s == 'median':
query = '''
SELECT
PERCENTILE_APPROX({}, 0.5)
FROM df
GROUP BY {}
'''.format(col, ','.join(gb_cols))
calc_col_series = spark.sql(query)
df = df.withColumn(new_col, calc_col_series.over(agg_window))
# Differencing inputs
for difference in range(1, 3 + 1):
# Last period's datapoints... moved to the future
led_series = func.lag(df[new_col], difference).over(agg_window.orderBy(window_cols['orderBy']))
diff_series = df[new_col] - led_series
new_col_diff = '{}_{}_diff'.format(new_col, difference)
df = df.withColumn(new_col_diff, diff_series)
Xcols.append(new_col_diff)

Adding a specified value to each in a pandas data frame

I am iterating over the rows that are available, but it doesn't seem to be the most optimal way to do it -- it takes forever.
Is there a special way in Pandas to do it.
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
#NEED TO ADD DATA FROM THAT COLUMN
df = pd.read_csv(dataset_path, delimiter=',',skiprows=range(0,1),names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME'])
df = df.drop('MULT',1)
df = df.drop('EMPTY',1)
df = df.drop('TSTAMP', 1)
for index, row in df.iterrows():
TMP_TIME = INIT_TIME + datetime.timedelta(seconds=row['TCOUNT'])
df['STAMPME'] = TMP_TIME.strftime("%s")
In addition, the datetime I am adding is in the following format
2017-05-11 11:12:37.100192 1494493957
2017-05-11 11:12:37.200541 1494493957
and therefore the unix timestamp is same (and it is correct), but is there a better way to represent it?
Assuming the datetimes are correctly reflecting what you're trying to do, with respect to Pandas you should be able to do:
df['STAMPME'] = df['TCOUNT'].apply(lambda x: (datetime.timedelta(seconds=x) + INIT_TIME).strftime("%s"))
As noted here you should not use iterrows() to modify the DF you are iterating over. If you need to iterate row by row (as opposed to using the apply method) you can use another data object, e.g. a list, to retain the values you're calculating, and then create a new column from that.
Also, for future reference, the itertuples() method is faster than iterrows(), although it requires you to know the index of each column (i.e. row[x] as opposed to row['name']).
I'd rewrite your code like this
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
INIT_TIME = pd.to_datetime(INIT_TIME)
df = pd.read_csv(
dataset_path, delimiter=',',skiprows=range(0,1),
names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME']
)
df = df.drop(['MULT', 'EMPTY', 'TSTAMP'], 1)
df['STAMPME'] = pd.to_timedelta(df['TCOUNT'], 's') + INIT_TIME

Categories

Resources