vectorizing nested for loops - pandas - python

I have a case where multiple attributes from an 'outside' for-loop are compared to multiple attributes from an 'inside' for loop.
Both loops are on pandas dataframes, and from a little reading, using iterrows() for this sort of a job is generally going to be slow.
Below is an indication of how / why this nested for loop is being used. It is very slow.
for key1, values1 in dataframe_1.iterrows():
for key2, values2 in dataframe_2.iterrows():
if values2['a'] > values1['a'] and value2['b'] == values1['b']:
# do something, such as append to a combined df
Is there a more suitable way to perform these sorts of nested comparisons on pandas dataframes? Is a different datatype (i.e. a dictionary) a better place to start?

You haven't to apply for loop or iterrows() at all in pandas:
for i in ((d2['a'] > d1['a']) & (d2['b'] == d1['b'])):
# do something
print i
Depending on what value want to do something with, you can alter the row:
(d2['a'] > d1['a']) & (d2['b'] == d1['b'])
to get the data needed to make some operation.

Related

How do I perform a complex SQL-Where operation affecting two tables in pandas?

If I have two tables, I can easily combine them in SQL using something like
SELECT a.*, b.* FROM table_1 a, table_2 b
WHERE (a.id < 1000 OR b.id > 700)
AND a.date < b.date
AND (a.name = b.first_name OR a.name = b.last_name)
AND (a.location = b.origin OR b.destination = 'home')
and there could be many more conditions. Note that this is just an example and the set of conditions may be anything.
The two easiest solutions in pandas that support any set of conditions are:
Compute a cross product of the tables and then filter one condition at a time.
Loop over one DataFrame (apply, itertuples, ...) and filter the second DataFrame in each iteration. Append the filtered DataFrames from each iteration.
In case of huge datasets (at least a few million rows per DataFrame), the first solution is impossible because of the required memory and the second one is considered an anti-pattern (https://stackoverflow.com/a/55557758/2959697). Either solution will be rather slow.
What is the pandaic way to proceed in this general case?
Note that I am not only interested in a solution to this particular problem but in the general concept of how to translate these types of statements. Can I use pandas.eval? Is it possible to perform a "conditional merge"? Etc.

Iterate through list of dataframes, performing calculations on certain columns of each dataframe, resulting in new dataframe of the results

Newbie here. Just as the title says, I have a list of dataframes (each dataframe is a class of students). All dataframes have the same columns. I have made certain columns global.
BINARY_CATEGORIES = ['Gender', 'SPED', '504', 'LAP']
for example. These are yes/no or male/female categories, and I have already changed all of the data to be 1's and 0's for these columns. There are several other columns which I want to ignore as I iterate.
I am trying to accept the list of classes (dataframes) into my function and perform calculations on each dataframe using only my BINARY_CATEGORIES list of columns. This is what I've got, but it isn't making it through all of the classes and/or all of the columns.
def bal_bin_cols(classes):
i = 0
c = 0
for x in classes:
total_binary = classes[c][BINARY_CATEGORIES[i]].sum()
print(total_binary)
i+=1
c+=1
Eventually I need a new dataframe from this all of the sums corresponding to the categories and the respective classes. print(total binary) is just a place holder/debugger. I don't have that code yet that will populate the dataframe from the results of the above code, but I'd like it to be the classes as the index and the total calculation as the columns.
I know there's probably a vectorized way to do this, or enum, or groupby, but I will take a fix to my loop. I've been stuck forever. Please help.
Try something like:
Firstly create a dictionary:
d={
'male':1,
'female':0,
'yes':1,
'no':0
}
Finally use replace():
df[BINARY_CATEGORIES]=df[BINARY_CATEGORIES].replace(d.keys(),d.values(),regex=True)

speed up loop - Assigning Values to Dataframe

i have a function that is running a little too slow for my liking and cannot seem to make it faster. I have a combination of 57 products and 402 stores. the function below creates dataframe with products as the index and stores as a columns. The objective is to fetch the max quanty sold by product and store and assign it to the "unconstraintload_df" dataframe. it seems to be doing the job, but it takes an awful amount of time to complete. Does anyone have any ideas to speed it up, please?
def getmaxsaleperproduct_and_store(product,store):
return training_DS[(training_DS["Prod Code"]==product)&(training_DS["Store"]==store)]["Sold Qty"].max()
def unconstraintsales():
global unconstraintload_df
ProdCodeList = training_DS["Prod Code"].unique()
StoreNumberList = training_DS["Store"].unique()
unconstraintload_df = pd.DataFrame(index=StoreNumberList,columns=ProdCodeList)
for store in StoreNumberList:
for prod in ProdCodeList:
unconstraintload_df.loc[unconstraintload_df.index==store,prod] = getmaxsaleperproduct_and_store(prod,store)
Consider pivot_table and avoid nested loops. Remember aggregations in Pandas rarely requires looping unlike general purpose Python using lists, tuples, or dictionaries:
unconstraintload_df = pd.pivot_table(training_DS, index="Prod Code", columns="Store",
values="Sold Qty", aggfunc="max")
Additionally, wide datasets beyond reporting tend to be less useful than long format. Consider long form aggregation with groupby and avoid 400+ columns to manage:
long_agg_df = training_DS.groupby(["Prod Code", "Store"])["Sold Qty"].max()
Try:
unconstraintload_df = training_DS[["Store", "Prod Code", "Sold Qty"]].groupby(["Store", "Prod Code"]).max().reset_index()

Correct way of testing Pandas dataframe values and modifying them

I need to modify some values of a Pandas dataframe based on a test, and leave the others values intact. I also need to leave the order of the rows intact.
I have a working code, based on iterating on the dataframe's rows. But it's horrendously slow. Is there a quicker way to get it done?
Here are two examples of this very slow code
for index, row in df.iterrows():
if df.number[index].is_integer():
df.number[index] = int(df.number[index])
for index, row in df.iterrows():
if df.string[index] == "XXX":
df.string[index] = df.other_colum[index].split("\")[0] + df.other_colum[index].split("\")[1]
else:
df.string[index] = df.other_colum[index].split("\")[1] + df.other_colum[index].split("\")[0]
Thanks
Generally you want to avoid iterating through rows in a pandas dataframe as it is slower than other methods pandas has created for accomplishing the same thing. One way of getting around this is using apply. You would redefine the number column:
df["number"] = df["number"].apply(lambda x: int(x) if x.is_integer() else x)
And (re)define the string column:
df["string"] = df["other column"].apply(lambda x: x.split("\\")[0] + x.split("\\")[1] if x == r"XX\X" else x.split("\\")[1] + x.split("\\")[0])
Made some assumptions based off of the data you removed from the problem set up -- .split("\") is incorrect syntax, and "other column" above necessarily has to have a backslash in it in order for your code (and mine) to work, otherwise .split("\\")[1] will return an error.

Python Pandas: .apply taking forever?

I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output
There are a couple of things that make this code run slowly.
apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join is the column from buys which contains the values you want to join together into a string.
groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.
To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)

Categories

Resources