Python Pandas: .apply taking forever? - python

I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output

There are a couple of things that make this code run slowly.
apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join is the column from buys which contains the values you want to join together into a string.
groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.
To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)

Related

Add Function to new Column by Row Value using Vectorization

I have a Pandas DataFrame with many columns and I need to take the part number column and use the data in it to populate the features column. My function add_data takes the part number and looks it up in a SQL database and returns the feature notes.
I have this working with df.apply and it works well
(code block 1).
Because I recently started learning vectorization and was wondering if there was a way to do this without df.apply(code block 2)
#Working Code
df['Features'] = df.apply (lambda row: add_data(row['partnumber']), axis=1)
def add_data(row):
featurenotes = sql_lookup(row)
return featurenotes
This line of code calls my add_data function but I don't know how to grab just the value of the row in the part number column to use in my function.
df['Features'] = add_data(df['partnumber'])
I'm am very new to python so I am trying to learn best practices and how to manipulate pandas DataFrames.

Pandas, accessing every nth element in nested array

I have a dataframe of many rows and 4 columns. Each column contains an array of 100 values.
My intuitive way of doing this is the same way I would do it with multi-dimensional numpy arrays.
For example, I want the first element of every array in column1. So I say
df["column1"][:][0]
To me this makes sense: first select the column, then take every array, then take the first element of every array.
However, it just doesn't work at all. Instead, it simply spits out the entire array from column1, row 1.
But - and this is the most frustrating thing - if I say:
df["column1"][1][0]
It gives me EXACTLY what I expect based on my expected logic, as in, I get the first element in the array in the second row in column1.
How can I get every nth element in every array in column1?
The reason that df["column1"][:][0] isn't doing what you expect is that df["column1"][:] returns a Series. With a Series, using bracket indexing returns the item of the series at that index.
If you want to a series where each item in the series is the item in the corresponding array at that index, the correct solution - whether it seems intuitive or not - is to use .str[...] on the Series.
Instead of
df["column1"][:][0]
use this:
df["column1"].str[0]
It might seem like .str should only be used for actual str values, but a neat trick is that works for lists too.
Here are some ways to do this:
[item[0] for item in df['column1']] # will result in a list
or
df['column1'].apply(lambda item: item[0]) # will result in a series
Not sure if you're looking for a way that's similar to slicing, but AFAIU pandas sees the lists in your table are just arbitrary objects, not something pandas provides a sugar for.
Of course, you can do other fancy things by creating a data frame out of your column:
pd.DataFrame(df['column1'].tolist())
And then do whatever you want with it.

Does Dask guarantee that rows inside partition (with a non-unique index) will never be reordered?

My application needs to read in a dataset into dask, spread across multiple partitions. With that dataframe, I need to do multiple operations on it, (eg subtracting one column from another or finding the ratio of two columns). The index for the dataframe is a nonunique column.
Because the application is entirely metadata driven, the order of the function calls is not known until runtime, so I have designed the application to rely on returning a new delayed dataframe at each stage. I wondered if some clever use of partitioning and column-wise concatenation could help me make this code efficient.
Given that these steps are independent of each other, in the specific example below can I trust the last operation to give the proper result for my row-wise ratio? i.e. If I carry out operations that only add new columns to dataframes, can I trust that the ordering of the rows will never change?
def subtract(df1, df2, col1, col2):
df_mod = copy(df1)
df_mod[f"{col1}-{col2}"] = df1[col1] - df2[col2]
return df_mod
def ratio(df1, df2, col1, col2):
df_mod = copy(df1)
# Rely on the row ordering being unchanged
df_mod[f"{col1}/{col2}"] = df1[col1] / df2[col2]
return df_mod
df = load_function_returns_dask_df()
first = subtract(df, df, "a","b")
second = subtract(df, df, "c","d")
last = ratio(first, second, "a-b","c-d")
I understand that I could operate directly on the dataframe to create a new column, but this does not work in the general case for arbitrary operations.
Intuitively it makes sense to me that this operation should work, since each partition is just a pandas dataframe, and it makes no sense for pandas to reorder the rows in a dataframe arbitrarily, but I was hoping for some way of verifying this more formally.
Correct, Dask will not reorder your partition rows so long as you are doing Pandas operations which themselves do not ordinarily reorder the rows (such as sort, obviously), which will be true for any row-wise computation.
Indeed the order of the partitions themselves is preserved as the data passes through operation after operation.

Returning a dataframe in Dask

Aim: To speed up applying a function row wise across a large data frame (1.9 million ~ rows)
Attempt: Using dask map_partitions where partitions == number of cores. I've written a function which is applied to each row, creates a dict containing a variable number of new values (between 1 and 55). This function works fine standalone.
Problem: I need a way to combine the output of each function into a final dataframe. I tried using df.append, where I'd append each dict to a new dataframe and return this dataframe. If I understand the Dask Docs, Dask should then combine them to one big DF. Unfortunately this line is tripping an error (ValueError: could not broadcast input array from shape (56) into shape (1)). Which leads me to believe it's something to do with the combine feature in Dask?
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
I am not quite sure I completely understand your code in lieu of an MCVE but I think there is a bit of a misunderstanding here.
In this piece of code you take a row and a DataFrame and append one row to that DataFrame.
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
Instead of appending to New_DF, I would recommend just returning a pd.Series which df.apply concatenates into a DataFrame. That is because if you are appending to the same New_DF object in all nCores partitions, you are bound to run into trouble.
#Function to applied row wise down the dataframe. Takes a row and returns a row.
def tobsecret_func(row):
post = str(row.post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
length_adjusted_series = pd.Series(scores).reindex(range(55))
return(length_adjusted_series)
Your error also suggests that as you wrote in your question, your function creates a variable number of values. If the pd.Series you return doesn't have the same shape and column names, then df.apply will fail to concatenate them into a pd.DataFrame. Therefore make sure you return a pd.Series of equal shape each time. This question shows you how to create pd.Series of equal length and index: Pandas: pad series on top or bottom
I don't know what kind of dict your OtherFUNC.countWords returns exactly, so you may want to adjust the line:
length_adjusted_series = pd.Series(scores).reindex(range(55))
As is, the line would return a Series with an index 0, 1, 2, ..., 54 and up to 55 values (if the dict originally had less than 55 keys, the remaining cells will contain NaN values).
This means after applied to a DataFrame, the columns of that DataFrame would be named 0, 1, 2, ..., 54.
Now you take your dataset and map your function to each partition and in each partition you apply it to the DataFrame using apply.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
map_partitions expects a function which takes as input a DataFrame and outputs a DataFrame. Your function is doing this by using a lambda function that basically calls your other function and applies it to a DataFrame, which in turn returns a DataFrame. This works but I highly recommend writing a named function which takes as input a DataFrame and outputs a DataFrame, it makes it easier for you to debug your code.
For example with a simple wrapper function like this:
df_wise(df):
return df.apply(tobsecret_func)
Especially as your code gets more complex, abstaining from using lambda functions that call non-trivial code like your custom func and instead making a simple named function can help you debug because the traceback will not just lead you to a line with a bunch of lambda functions like in your code but will also directly point to the named function df_wise, so you will see exactly where the error is coming from.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(df_wise,
meta=df_wise(dd.head())
).\
compute(get=get)
Notice that we just fed dd.head() to df_wise to create our meta-keyword which is similar to what Dask would do under the hood.
You are using dask.get, the synchronous scheduler which is why the whole New_DF.append(...) code could work, since you append to the DataFrame for each consecutive partition.
This does not give you any parallelism and thus will not work if you use one of the other schedulers, all of which parallelise your code.
The documentation also mentions the meta keyword argument, which you should supply to your map_partitions call, so dask knows what columns your DataFrame will have. If you don't do this, dask will first have to do a trial run of your function on one of the partitions and check what the shape of the output is before it can go ahead and do the other partitions. This can slow down your code by a ton if your partitions are large; giving the meta keyword bypasses this unnecessary computation for dask.

vectorizing nested for loops - pandas

I have a case where multiple attributes from an 'outside' for-loop are compared to multiple attributes from an 'inside' for loop.
Both loops are on pandas dataframes, and from a little reading, using iterrows() for this sort of a job is generally going to be slow.
Below is an indication of how / why this nested for loop is being used. It is very slow.
for key1, values1 in dataframe_1.iterrows():
for key2, values2 in dataframe_2.iterrows():
if values2['a'] > values1['a'] and value2['b'] == values1['b']:
# do something, such as append to a combined df
Is there a more suitable way to perform these sorts of nested comparisons on pandas dataframes? Is a different datatype (i.e. a dictionary) a better place to start?
You haven't to apply for loop or iterrows() at all in pandas:
for i in ((d2['a'] > d1['a']) & (d2['b'] == d1['b'])):
# do something
print i
Depending on what value want to do something with, you can alter the row:
(d2['a'] > d1['a']) & (d2['b'] == d1['b'])
to get the data needed to make some operation.

Categories

Resources