I presume similar questions exist, but could not locate them. I have Pandas 0.19.2 installed. I have a large dataframe, and for each row value I want to carry over the previous row's value for the same column based on some logical condition.
Below is a brute-force double for loop solution for a small example. What is the most efficient way to implement this? Is it possible to solve this in a vectorised manner?
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame(np.random.uniform(low=-0.2, high=0.2, size=(10,2) ))
print(df)
for col in df.columns:
prev = None
for i,r in df.iterrows():
if prev is not None:
if (df[col].loc[i]<= prev*1.5) and (df[col].loc[i]>= prev*0.5):
df[col].loc[i] = prev
prev = df[col].loc[i]
print(df)
Output:
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.132356 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830
EDIT: Please note one value can be carried over multiple times, so long as it satisfies the logical condition.
prev = df.shift()
replace_mask = (0.5 * prev <= df) & (df <= 1.5 * prev)
df = df.where(~replace_mask, prev)
I came up with this:
keep_going = True
while keep_going:
df = df.mask((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).ffill()
trimming_to_do = ((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).values.any()
if not trimming_to_do:
keep_going= False
which gives the desired result (at least for this case):
print(df)
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.120775 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830
Related
I have the simple dataframe and I would like to add the column 'Pow_calkowita'. If 'liczba_kon' is 0, 'Pow_calkowita' is 'Powierzchn', but if 'liczba_kon' is not 0, 'Pow_calkowita' is 'liczba_kon' * 'Powierzchn. Why I can't do that?
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
row['Pow_calkowita'] = row['Powierzchn']
elif row['liczba_kon'] != 0:
row['Pow_calkowita'] = row['Powierzchn'] * row['liczba_kon']
My code didn't return any values.
liczba_kon Powierzchn
0 3 69.60495
1 1 39.27270
2 1 130.41225
3 1 129.29570
4 1 294.94400
5 1 64.79345
6 1 108.75560
7 1 35.12290
8 1 178.23905
9 1 263.00930
10 1 32.02235
11 1 125.41480
12 1 47.05420
13 1 45.97135
14 1 154.87120
15 1 37.17370
16 1 37.80705
17 1 38.78760
18 1 35.50065
19 1 74.68940
I have found some soultion:
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
Is it good way?
To write idiomatic code for Pandas and leverage on Pandas' efficient array processing, you should avoid writing codes to loop over the array by yourself. Pandas allows you to write succinct codes yet process efficiently by making use of vectorization over its efficient numpy ndarray data structure. Underlying, it uses fast array processing using optimized C language binary codes. Pandas already handles the necessary looping behind the scene and this is also an advantage using Pandas by single statement without explicitly writing loops to iterate over all elements. By using Pandas, you would better enjoy its fast efficient yet succinct vectorization processing instead.
As your formula is based on a condition, you cannot use direct multiplication. Instead you can use np.where() as follows:
import numpy as np
df['Pow_calkowita'] = np.where(df['liczba_kon'] == 0, df['Powierzchn'], df['Powierzchn'] * df['liczba_kon'])
When the test condition in first parameter is true, the value from second parameter is taken, else, the value from the third parameter is taken.
Test run output: (Add 2 more test cases at the end; one with 0 value of liczba_kon)
print(df)
liczba_kon Powierzchn Pow_calkowita
0 3 69.60495 208.81485
1 1 39.27270 39.27270
2 1 130.41225 130.41225
3 1 129.29570 129.29570
4 1 294.94400 294.94400
5 1 64.79345 64.79345
6 1 108.75560 108.75560
7 1 35.12290 35.12290
8 1 178.23905 178.23905
9 1 263.00930 263.00930
10 1 32.02235 32.02235
11 1 125.41480 125.41480
12 1 47.05420 47.05420
13 1 45.97135 45.97135
14 1 154.87120 154.87120
15 1 37.17370 37.17370
16 1 37.80705 37.80705
17 1 38.78760 38.78760
18 1 35.50065 35.50065
19 1 74.68940 74.68940
20 0 69.60495 69.60495
21 2 74.68940 149.37880
To answer the first question: "Why I can't do that?"
The documentation states (in the notes):
Because iterrows returns a Series for each row, ....
and
You should never modify something you are iterating over. [...] the iterator returns a copy and not a view, and writing to it will have no effect.
this basically means that it returns a new Series with the values of that row
So, what you are getting is NOT the actual row, and definitely NOT the dataframe!
BUT what you are doing is working, although not in the way that you want to:
df = DF(dict(a= [1,2,3], b= list("abc")))
df # To demonstrate what you are doing
a b
0 1 a
1 2 b
2 3 c
for index, row in df.iterrows():
... print("\n------------------\n>>> Next Row:\n")
... print(row)
... row["c"] = "ADDED" ####### HERE I am adding to 'the row'
... print("\n -- >> added:")
... print(row)
... print("----------------------")
...
------------------
Next Row: # as you can see, this Series has the same values
a 1 # as the row that it represents
b a
Name: 0, dtype: object
-- >> added:
a 1
b a
c ADDED # and adding to it works... but you aren't doing anything
Name: 0, dtype: object # with it, unless you append it to a list
----------------------
------------------
Next Row:
a 2
b b
Name: 1, dtype: object
### same here
-- >> added:
a 2
b b
c ADDED
Name: 1, dtype: object
----------------------
------------------
Next Row:
a 3
b c
Name: 2, dtype: object
### and here
-- >> added:
a 3
b c
c ADDED
Name: 2, dtype: object
----------------------
To answer the second question: "Is it good way?"
No.
Because using the multiplication like SeaBean has shown actually uses the power of
numpy and pandas, which are vectorized operations.
This is a link to a good article on vectorization in numpy arrays, which are basically the building blocks of pandas DataFrames and Series.
dataframe is designed to operate with vectorication. you can treat it as a database table. So you should use its functions as long as it's possible.
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
This simplified the code and enhanced performance., we can test their performance:
sampleSize = 100000
df=pd.DataFrame({
'liczba_kon': np.random.randint(3, size=(sampleSize)),
'Powierzchn': np.random.randint(1000, size=(sampleSize)),
})
# vectorication
s = time.time()
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
print(time.time() - s)
# iteration
s = time.time()
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
print(time.time() - s)
We can see vectorication performed much faster.
0.0034716129302978516
6.193516492843628
I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new
I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers
I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2
Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.
I'm looking to make it so that NaN values in a dataframe are filled in by the mean of all the values up to that point, as such:
A
0 1
1 2
2 3
3 4
4 5
5 NaN
6 NaN
7 11
8 NaN
Would become
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
You can solve it by running the following code
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": [ 1, 2, 3, 4, 5, pd.NA, pd.NA, 11, pd.NA ]
})
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
It iterates on each NaN and fills it with the mean of the previous values, including those filled NaNs.
At the end you will have:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
EDIT
As stated by RichieV, performance may be an issue with this solution (its runtime complexity is O(N^2)) when there are many NaNs, but we also should avoid python iterations, since they are slow when compared to native pandas / numpy calls.
Here is an optimized version:
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = int(cumsum / cumnum)
last_idx = idx
Result:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
Since in the worst case the script should pass on the dataframe twice, the runtime complexity is now O(N).
Marco's answer works fine but it can be optimized with incremental average formulas, from math.stackexchange.com
Here is an adaptation of that other question (not the exact formula, just the concept).
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
The main advantage with this code is not having to read all items up to the current index on each iteration to get the mean.
This option still uses a python loop--which is not the best choice with pandas--but there seems to be no way around it for this use case (hopefully someone will get inspired by this and find such method without a loop).
Performance tests
Three alternative functions were defined:
incremental: My answer.
from_origin: Marco's original answer.
incremental_pandas: Marco's updated answer.
Tests were done using timeit module with 3 repetitions on random samples with 0.4 probability of NaN.
Full code for testing
import pandas as pd
import numpy as np
import timeit
import collections
from matplotlib import pyplot as plt
def incremental(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
return df
def incremental_pandas(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = cumsum / cumnum
last_idx = idx
return df
def from_origin(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
return df
def get_random_sample(n, p):
np.random.seed(123)
return pd.DataFrame({'A':
np.random.choice(list(range(10)) + [np.nan],
size=n, p=[(1 - p) / 10] * 10 + [p])})
r = 3
p = 0.4 # portion of NaNs
# check result from all functions
results = []
for func in [from_origin, incremental, incremental_pandas]:
random_df = get_random_sample(1000, p)
new_df = random_df.copy(deep=True)
results.append(func(new_df))
print('Passed' if all(np.allclose(r, results[0]) for r in results[1:])
else 'Failed', 'implementation test')
timings = {}
for n in np.geomspace(10, 10000, 10):
random_df = get_random_sample(int(n), p)
timings[n] = collections.defaultdict(float)
results = {}
for func in ['incremental', 'from_origin', 'incremental_pandas']:
timings[n][func] = (
timeit.timeit(f'{func}(random_df.copy(deep=True))', number=r, globals=globals())
/ r
)
timings = pd.DataFrame(timings).T
print(timings)
timings.plot()
plt.xlabel('size of array')
plt.ylabel('avg runtime (s)')
plt.ylim(0)
plt.grid(True)
plt.tight_layout()
plt.show()
plt.close('all')
I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02
I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')