We have a large (spark) dataframe, and we need to compute a new column. Each row is calculated from the value in the previous row in the same column (the first row in the new column is simply 1). This is trivial in a for-loop, but due to the extremely large number of rows we wish to do this in a window function. Because the input to the current row is the previous row, it is not obvious to us how we can achieve this, if it's possible.
We have a large dataframe with a column containing one of three values: A, B and C. Each of these 3 options represents a formula to compute a new value in a new column in the same row.
If it is A, then the new value is 1.
If it is B, then the new value is the same as the previous row.
if it is C, then the new value is the same as the previous row + 1.
For example:
A
B
B
C
B
A
C
B
C
A
Should become:
A 1
B 1
B 1
C 2
B 2
A 1
C 2
B 2
C 3
A 1
We can achieve this behavior as follows using a for loop (pseudocode):
for index in range(my_df):
if index == 0:
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'A':
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'B':
my_df[new_column][index] = my_df[new_column][index-1]
elseif my_df[letter_column][index] == 'C':
my_df[new_column][index] = my_df[new_column][index-1] + 1
We wish to replace the for loop with a window function. We tried using the 'lag' keyword, but the previous row's value depends on previous calculations. Is there a way to do this or is it fundamentally impossible to do this with a window (or map) function? And if it's impossible, is there an alternative that would be faster than the for loop? (A reduce function would have similar performance?)
Again, due to the extremely large number of rows, this is about performance. We should have enough RAM to hold everything in memory, but we wish the processing to be as quick as possible (and to learn how to solve analogues of this problem more generally: applying window functions that require data calculated in previous rows of that window function). Any help would be much appreciated!!
Kind regards,
Mick
Related
Imagine you have a pyspark data frame df with three columns: A, B, C. I want to take the rows in the data frame where the value of B does not exist in C.
Example:
A B C
a 1 2
b 2 4
c 3 6
d 4 8
would return
A B C
a 1 2
c 3 6
What I tried
df.filter(~df.B.isin(df.C))
I also tried to making the values of B into a list, but that takes a significant amount of time.
The problem is how you're using isin. For better or worse, isin can't actually handle another pyspark Column object as an input, it needs an actual collection. So one thing you could do is convert your column to a list :
col_values = df.select("C").rdd.flatMap(lambda x: x).collect()
df.filter(~df.B.isin(col_values))
Performance wise though, this is obviously not ideal as your master node is now in charge of manipulating the entire contents of the single column you've just loaded into memory. You could use a left anti join to get the result you need without having to transform anything into a list and losing the efficiency of spark distributed computing :
df0 = df[["C"]].withColumnRenamed("C", "B")
df.join(df0, "B", "leftanti").show()
Thanks to Emma in the comments for her contribution.
I have DataFrame with thousands rows. Its structure is as below
A B C D
0 q 20 'f'
1 q 14 'd'
2 o 20 'a'
I want to compare the A column of current row and next row. If those values are equal I want to add the value of B column which has lower the value to D column of compared row which has greater value. Then I want to remove the moved column value of column B. It's like a swap process.
A B C D
0 q 20 'f' 14
1 o 20 'a'
I have thousands rows and iloc, loc, at methods work slow. At least I want to use DataFrame apply method. I tried some code samples but they didn't work.
I want to do something as below:
DataFrame.apply(lambda row: self.compare(row, next(row)), axis=1))
I have a compare method but I couldn't pass next row to the compare method. How can I pass it to the method? Also I am open to hear faster pandas solutions.
Best not to do that with apply as it will be slow; you can look at using shift, e.g.
df['A_shift'] = df['A'].shift(1)
df['Is_Same'] = 0
df.loc[df.A_shift == df.A, 'Is_Same'] = 1
Gets a bit more complicated if you're doing the shift within groups, but still possible.
At some point in a running process i have two dataframes containing standard deviation values from various sampling distributions.
dfbest keeps the smallest deviations as the best. dftmp records the current values..
so, the first time dftmp looks like this
0 1 2 3 4
0 22.552408 7.299163 15.114379 5.214829 9.124144
with dftmp.shape (1, 5)
Ignoring for a moment the pythonic constructs and treating the dataframes as 2d arrays like a spreadsheet in VBA excel i write...
A:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[R][C] > dftmp[R][C]:
dfbest[R][C] = dftmp[R][C]
B:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[C][R] > dftmp[C][R]:
dfbest[C][R] = dftmp[C][R]
Code A fails while B works. I 'd expect the opposite but i m new to python so who knows what i m not seeing here.. Any suggestions? I suspect there is a more appropriate .iloc solution to this.
When you access a dataframe like that (dfbest[x][y]), x means a column, then y a row. That's why code B works.
Here is more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics
I need to calculate a sum in python based on a huge json file. It is of importance that the calculation is correct. First I add it to pandas and it gets a structure like the following, but bigger.
A B
a 1
a 2
b 2
Then I want the sum of column B where A is a. To do this I use the pandas query method. The problem is that sometimes it gives the correct answer, 3, and sometimes just 0. I have tried both of the code syntaxes below, which I think is equivalent to each other.
my_sum = df[df["A"] == "a"].sum()["B"]
my_sum = df.query("A == 'a'")['B'].sum()
Could it be the query that is run asynchronously?
How can this sum be calculated without getting any inconsistencies?
Clearifications:
my_sum sometimes equals 3
but most often
my_sum equals 0
It is more often 3 when running in the pycharm debugger.
Column B consists of floats.
I'm searching to improve my code, and I don't find any clue for my problem.
I have 2 dataframes (let's say A and B) with same number of row & columns.
I want to create a third dataframe C, which will transformed each A[x,y] element based on B[x,y] element.
Actually I perform the operation with 2 loop, one for x and one for y dimension :
import pandas
A=pandas.DataFrame([["dataA1","dataA2","dataA3"],["dataA4","dataA5","dataA6"]])
B=pandas.DataFrame([["dataB1","dataB2","dataB3"],["dataB4","dataB5","dataB6"]])
Result=pandas.DataFrame([["","",""],["","",""]])
def mycomplexfunction(x,y):
return str.upper(x)+str.lower(y)
for indexLine in range(0,2):
for indexColumn in range(0,3):
Result.loc[indexLine,indexColumn]=mycomplexfunction(A.loc[indexLine,indexColumn],B.loc[indexLine,indexColumn])
print(Result)
0 1 2
0 DATAA1datab1 DATAA2datab2 DATAA3datab3
1 DATAA4datab4 DATAA5datab5 DATAA6datab6
but I'm searching for a more elegant and fastway to do it directly by using dataframe functions.
Any idea ?
Thanks,