Fastest way to apply function involving multiple dataframe - python

I'm searching to improve my code, and I don't find any clue for my problem.
I have 2 dataframes (let's say A and B) with same number of row & columns.
I want to create a third dataframe C, which will transformed each A[x,y] element based on B[x,y] element.
Actually I perform the operation with 2 loop, one for x and one for y dimension :
import pandas
A=pandas.DataFrame([["dataA1","dataA2","dataA3"],["dataA4","dataA5","dataA6"]])
B=pandas.DataFrame([["dataB1","dataB2","dataB3"],["dataB4","dataB5","dataB6"]])
Result=pandas.DataFrame([["","",""],["","",""]])
def mycomplexfunction(x,y):
return str.upper(x)+str.lower(y)
for indexLine in range(0,2):
for indexColumn in range(0,3):
Result.loc[indexLine,indexColumn]=mycomplexfunction(A.loc[indexLine,indexColumn],B.loc[indexLine,indexColumn])
print(Result)
0 1 2
0 DATAA1datab1 DATAA2datab2 DATAA3datab3
1 DATAA4datab4 DATAA5datab5 DATAA6datab6
but I'm searching for a more elegant and fastway to do it directly by using dataframe functions.
Any idea ?
Thanks,

Related

Pandas Apply/Lambda returning dataframe and not single row

New to Python and Pandas, so please bear with me here.
I have created a dataframe with 10 rows, with a column called 'Distance' and I want to calculate a new column (TotalCost) with apply and a lambda funtion that I have created. Snippet below of the function
def TotalCost(Distance, m, c):
return m * df.Distance + c
where Distance is the column in the dataframe df, while m and c are just constants that I declare earlier in the main code.
I then try to apply it in the following manner:
df = df.apply(lambda row: TotalCost(row['Distance'], m, c), axis=1)
but when running this, I keep getting a dataframe as an output, instead of a single row.
EDIT: Adding in an example of input and desired output,
Input: df = {Distance: '1','2','3'}
if we assume m and c equal 10,
then the output of applying the function should be
df['TotalCost'] = 20,30,40
I will post the error below this, but what am I missing here? As far as I understand, my syntax is correct. Any assistance would be greatly appreciated :)
The error message:
ValueError: Wrong number of items passed 10, placement implies 1
Your lambda in apply should process only one row. BTW, apply return only calculated columns, not whole dataframe
def TotalCost(Distance,m,c): return m * Distance + c
df['TotalCost'] = df.apply(lambda row: TotalCost(row['Distance'],m,c),axis=1)
Your apply function will basically pass one row at a time to your lambda function and then returns a copy of your data frame with the edited or changed values
Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering the original dataframe.
have a look at this link it should help you gain more insight
https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/
import numpy as np
import pandas as pd
def star(x,m,c):
return x*m+c
vals=[(1,2,4),
(3,4,5),
(5,6,6) ]
df=pd.DataFrame(vals,columns=('one','two','three'))
res=df.apply(star,axis=0,args=[2,3])
Initial DataFrame
one two three
0 1 2 4
1 3 4 5
2 5 6 6
After applying the function you should get this stored in res
one two three
0 5 7 11
1 9 11 13
2 13 15 15
This is a more memory-efficient and cleaner way:
df.eval('total_cost = #m * Distance + #c', inplace=True)
Update: I also sometimes stick to assign,
df = df.assign(total_cost=lambda x: TotalCost(x['Distance'], m, c))

Looping through a Pandas dataframe as a two dimensional array

At some point in a running process i have two dataframes containing standard deviation values from various sampling distributions.
dfbest keeps the smallest deviations as the best. dftmp records the current values..
so, the first time dftmp looks like this
0 1 2 3 4
0 22.552408 7.299163 15.114379 5.214829 9.124144
with dftmp.shape (1, 5)
Ignoring for a moment the pythonic constructs and treating the dataframes as 2d arrays like a spreadsheet in VBA excel i write...
A:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[R][C] > dftmp[R][C]:
dfbest[R][C] = dftmp[R][C]
B:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[C][R] > dftmp[C][R]:
dfbest[C][R] = dftmp[C][R]
Code A fails while B works. I 'd expect the opposite but i m new to python so who knows what i m not seeing here.. Any suggestions? I suspect there is a more appropriate .iloc solution to this.
When you access a dataframe like that (dfbest[x][y]), x means a column, then y a row. That's why code B works.
Here is more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics

Converting for-loop to windowed function

We have a large (spark) dataframe, and we need to compute a new column. Each row is calculated from the value in the previous row in the same column (the first row in the new column is simply 1). This is trivial in a for-loop, but due to the extremely large number of rows we wish to do this in a window function. Because the input to the current row is the previous row, it is not obvious to us how we can achieve this, if it's possible.
We have a large dataframe with a column containing one of three values: A, B and C. Each of these 3 options represents a formula to compute a new value in a new column in the same row.
If it is A, then the new value is 1.
If it is B, then the new value is the same as the previous row.
if it is C, then the new value is the same as the previous row + 1.
For example:
A
B
B
C
B
A
C
B
C
A
Should become:
A 1
B 1
B 1
C 2
B 2
A 1
C 2
B 2
C 3
A 1
We can achieve this behavior as follows using a for loop (pseudocode):
for index in range(my_df):
if index == 0:
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'A':
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'B':
my_df[new_column][index] = my_df[new_column][index-1]
elseif my_df[letter_column][index] == 'C':
my_df[new_column][index] = my_df[new_column][index-1] + 1
We wish to replace the for loop with a window function. We tried using the 'lag' keyword, but the previous row's value depends on previous calculations. Is there a way to do this or is it fundamentally impossible to do this with a window (or map) function? And if it's impossible, is there an alternative that would be faster than the for loop? (A reduce function would have similar performance?)
Again, due to the extremely large number of rows, this is about performance. We should have enough RAM to hold everything in memory, but we wish the processing to be as quick as possible (and to learn how to solve analogues of this problem more generally: applying window functions that require data calculated in previous rows of that window function). Any help would be much appreciated!!
Kind regards,
Mick

Python Pandas df.isin shows inaccurate results

I have a point cloud of 6 millions x, y and z points I need to process. I need to look for specific points within this 6 millions xyz points and I have using pandas df.isin() function to do it. I first save the 6 millions points into a pandas dataframe (save under the name point_cloud) and for the specific point I need to look for into a dateframe as well (save under the name specific_point). I only have two specific point I need to look out for. So the output of the df.isin() function should show 2 True value but it is showing 3 instead.
In order to prove that 3 True values are wrong. I actually iterate through the 6 millions point clouds looking for the two specific points using iterrows(). The result was indeed 2 True value. So why is df.isin() showing 3 instead of the correct result of 2?
I have tried this, which result true_count to be 3
label = (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).astype(int).to_frame()
true_count = 0
for index, t_f in label.iterrows():
if int(t_f.values) == int(1):
true_count += 1
print(true_count)
I have tried this as well, also resulting in true_count to be 3.
for t_f in (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).values
true_count = 0
if t_f == True:
true_count += 1
Lastly I tried the most inefficient way of iterating through the 6 millions points using iterrows() but this result the correct value for true_count which is 2.
true_count = 0
for index_sp, sp in specific_point.iterrows():
for index_pc, pc in point_cloud.iterrows():
if sp['x'] == pc['x'] and sp['y'] == pc['y'] and sp['z] == pc['z]:
true_count += 1
print(true_count)
Do anyone know why is df.isin() behaving this way? Or have I seem to overlook something?
isin function for multiple columns with and will fail to look the dataframe per row, it is more like check the product the list in dataframe .
So what you can do is
checked=point_cloud.merge(specific_point,on=['x','y','z'],how='inner')
For example, if you have two list l1=[1,2];l2=[3,4], using isin , it will return any row match [1,3],[1,4],[2,3],[2,4]

Optimization problem with Pandas apply and multiIndex search [duplicate]

This question already has an answer here:
How do you shift Pandas DataFrame with a multiindex?
(1 answer)
Closed 4 years ago.
So, I was wondering if I am doing this correctly, because maybe there is a much better way to do this and I am wasting a lot of time.
I have a 3 level index dataframe, like this:
IndexA IndexB IndexC ColumnA ColumnB
A B C1 HiA HiB
A B C2 HiA2 HiB2
I need to do a search for every row, saving data from other rows. I know this sounds strange, but it makes sense with my data. For example:
I want to add ColumnB data from my second row to the first one, and vice-versa, like this:
IndexA IndexB IndexC ColumnA ColumnB NewData
A B C1 HiA HiB HiB2
A B C2 HiA2 HiB2 HiB
In order to do this search, I do an apply on my df, like this:
df['NewData'] = df.apply(lambda r: my_function(df, r.IndexA, r.IndexB, r.IndexC), axis=1)
Where my function is:
def my_function(df, indexA, indexB, indexC):
idx = pd.IndexSlice
#Here I do calculations (substraction) to know what C exactly I want
#newIndexC = C - someConstantValue
try:
res = df.loc[idx[IndexA, IndexB, newIndexC],'ColumnB']
return res
except KeyError:
return -1
I tried to simplify a lot of this problem, sorry if it sounds confusing. Basically my data frame has 20 million rows, and this search takes 2 hours. I know it has to take a lot, because there are a lot of accesses, but I wanted to know if there could be a faster way to do this search.
More information:
On indexA I have different groups of values. Example: Countries.
On indexB I have different groups of dates.
On indexC I have different groups of values.
Answer:
df['NewData'] = df.groupby(level=['IndexA', 'IndexB'])['ColumnB'].shift(7)
All you're really doing is a shift. You can speed it up 1000x like this:
df['NewData'] = df['ColumnB'].shift(-someConstantValue)
You'll need to roll the data from the top someConstantValue number of rows around to the bottom--I'm leaving that as an exercise.

Categories

Resources