I am trying to generate a new column in a pandas dataframe by loop over >100,000 rows and setting the value of the row conditional on an already existing row.
The current dataframe is a dummy but works as an example. My current code is:
df=pd.DataFrame({'IT100':[5,5,-0.001371,0.0002095,-5,0,-5,5,5],
'ET110':[0.008187884,0.008285232,0.00838258,0.008479928,1,1,1,1,1]})
# if charging set to 1, if discharging set to -1.
# if -1 < IT100 < 1 then set CD to previous cells value
# Charging is defined as IT100 > 1 and Discharge is defined as IT100 < -1
def CD(dataFrame):
for x in range(0,len(dataFrame.index)):
current = dataFrame.loc[x,"IT100"]
if x == 0:
if dataFrame.loc[x+5,"IT100"] > -1:
dataFrame.loc[x,"CD"] = 1
else:
dataFrame.loc[x,"CD"] = -1
else:
if current > 1:
dataFrame.loc[x,"CD"] = 1
elif current < -1:
dataFrame.loc[x,"CD"] = -1
else:
dataFrame.loc[x,"CD"] = dataFrame.loc[x-1,"CD"]
Using if/Else loops is extremely slow. I see that people have suggested to use np.select() or pd.apply(), but I do not know if this will work for my example. I need to be able to index the column because one of my conditions is to set the value of the new column to the value of the previous cell in the column of interest.
Thanks for any help!
#Grajdeanu Alex is right, the loop is slowing you down more than whatever you're doing inside of it. With pandas, a loop is usually the slowest choice. Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
df['CD'] = np.nan
#lower saturation
df.loc[df['IT100'] < -1,['CD']] = -1
#upper saturation
df.loc[df['IT100'] > 1,['CD']] = 1
#fill forward
df['CD'] = df['CD'].ffill()
# setting the first row equal to the fifth
df.loc[0,['CD']] = df.loc[5,['CD']]
using ffill will use the last valid value to fill in subsequent nan values (-1 < x < 1)
Similar to EMiller's answer, you could also use clip.
import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
df['CD'] = df['IT100'].clip(-1, 1)
df.loc[~df['CD'].isin([-1, 1]), 'CD'] = np.nan
df['CD'] = df['CD'].ffill()
df.loc[0,['CD']] = df.loc[5,['CD']]
As an alternate to #EMiller's answer
In [213]: df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
In [214]: df
Out[214]:
IT100
0 0.00
1 -50.00
2 -20.00
3 -0.50
4 -0.25
5 -0.50
6 -10.00
7 5.00
8 0.50
In [215]: df['CD'] = pd.Series(np.where(df['IT100'].between(-1, 1), np.nan, df['IT100'].clip(-1, 1))).ffill()
In [217]: df.loc[0, 'CD'] = 1 if df.loc[5, 'IT100'] > -1 else -1
In [218]: df
Out[218]:
IT100 CD
0 0.00 1.0
1 -50.00 -1.0
2 -20.00 -1.0
3 -0.50 -1.0
4 -0.25 -1.0
5 -0.50 -1.0
6 -10.00 -1.0
7 5.00 1.0
8 0.50 1.0
Related
How do I drop all the rows that have a RET with absolute value greater than 0.10?
Tried using this, not sure what to do
df_drop = [data[data[abs(float('RET')) < 0.10]
You can keep RET with absolute value greater than 0.10 by
data = data[abs(data['RET'].astype('float')) >= 0.10]
I advice you to code it in two steps to figure it out better.
import pandas as pd
df = pd.DataFrame({'val':[5, 11, 89, 63],
'RET':[-0.1, 0.5, -0.04, 0.09],
})
# Step 1 : define the mask
m = abs(df['RET']) > 0.1
# Step 2 : apply the mask
df[m]
print(df)
Result
df[m]
val RET
1 11 0.5
You can use abs() and le() or lt() to filter for the wanted values.
df_drop = data[data['RET'].abs().lt(0.10)]
Also consider between(). Select a appropriate policy for inclusive. It can be "neither", "both", "left" or "right".
df_drop = data[data['RET'].between(-0.10, 0.10, inclusive="neither")]
Example
data = pd.DataFrame({
'RET':[0.011806, -0.122290, 0.274011, 0.039013, -0.05044],
'other': [1,2,3,4,5]
})
RET other
0 0.011806 1
1 -0.122290 2
2 0.274011 3
3 0.039013 4
4 -0.050440 5
Both ways aboth will lead to
RET other
0 0.011806 1
3 0.039013 4
4 -0.050440 5
All rows with an absolute value greater than 0.10 in RET are excluded.
I would like to know how to achieve this question. Given a dataframe, I want to create an array getting all values between -1 to 1, just the values, I don't care about the day or index.
Here is the code:
import pandas as pd
import numpy as np
import random
data = [[round(random.uniform(1,100),2) for i in range(7)] for i in range(10)]
header = [['Lunes', 'Martes', 'Miércoles', 'Jueves','Viernes', 'Sábado','Domingo']]
df = pd.DataFrame(data, columns = header)
mean = df.mean()
std = df.std()
df_normalizado = (df-mean)/std
Lunes Martes Miércoles Jueves Viernes Sábado Domingo
0 -0.250799 1.001706 -0.491738 0.444629 -0.296997 -0.670781 -1.554641
1 -0.868792 -0.100689 -0.359056 1.282681 1.352212 1.176829 -1.374482
2 -0.614918 1.187862 1.398010 1.037513 -1.149555 -0.834707 0.143520
3 -0.319758 1.113691 -0.719597 -1.392089 -0.591716 0.943564 -1.163994
4 -0.718137 -1.300041 1.267097 -0.797168 0.053323 1.187264 0.078008
5 -0.883286 -0.821076 -0.671478 1.268079 0.002583 -0.897651 1.096177
6 1.933040 -0.534570 -1.142057 -0.262689 1.417233 0.851335 0.780141
7 -0.433957 -0.575776 1.406855 0.248020 -1.113399 -0.178332 0.497165
8 1.357213 -1.070254 -0.882708 -1.133679 -0.863344 -1.613941 0.491402
9 0.799394 1.099147 0.194671 -0.695298 1.189661 0.036420 1.006704
To clarify:
enter image description here
Thank you. community!
Since just an array is needed, grab values from the DataFrame and use normal boolean indexing:
a = df.values
print(a[(-1 <= a) & (a <= 1)])
Output:
[-0.250799 -0.491738 0.444629 -0.296997 -0.670781 -0.868792 -0.100689
-0.359056 -0.614918 -0.834707 0.14352 -0.319758 -0.719597 -0.591716
0.943564 -0.718137 -0.797168 0.053323 0.078008 -0.883286 -0.821076
-0.671478 0.002583 -0.897651 -0.53457 -0.262689 0.851335 0.780141
-0.433957 -0.575776 0.24802 -0.178332 0.497165 -0.882708 -0.863344
0.491402 0.799394 0.194671 -0.695298 0.03642 ]
Python pandas suggest query function.
I hope this would be helpful to slove your issue
df.query("Lunes >= -1 and Lunes <= 1 and
Martes >= -1 and Martes <= 1 and
Miércoles >= -1 and Miércoles <= 1 and
Jueves >= -1 and Jueves <= 1 and
Viernes >= -1 and Viernes <= 1 and
Sábado >= -1 and Sábado <= 1 and
Domingo >= -1 and Domingo <=1")
I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new
I presume similar questions exist, but could not locate them. I have Pandas 0.19.2 installed. I have a large dataframe, and for each row value I want to carry over the previous row's value for the same column based on some logical condition.
Below is a brute-force double for loop solution for a small example. What is the most efficient way to implement this? Is it possible to solve this in a vectorised manner?
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame(np.random.uniform(low=-0.2, high=0.2, size=(10,2) ))
print(df)
for col in df.columns:
prev = None
for i,r in df.iterrows():
if prev is not None:
if (df[col].loc[i]<= prev*1.5) and (df[col].loc[i]>= prev*0.5):
df[col].loc[i] = prev
prev = df[col].loc[i]
print(df)
Output:
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.132356 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830
EDIT: Please note one value can be carried over multiple times, so long as it satisfies the logical condition.
prev = df.shift()
replace_mask = (0.5 * prev <= df) & (df <= 1.5 * prev)
df = df.where(~replace_mask, prev)
I came up with this:
keep_going = True
while keep_going:
df = df.mask((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).ffill()
trimming_to_do = ((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).values.any()
if not trimming_to_do:
keep_going= False
which gives the desired result (at least for this case):
print(df)
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.120775 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830
I've got a many row, many column dataframe with different 'placeholder' values needing substitution (in a subset of columns). I've read many examples in the forum using nested lists or dictionaries, but haven't had luck with variations..
# A test dataframe
df = pd.DataFrame({'Sample':['alpha','beta','gamma','delta','epsilon'],
'element1':[1,-0.01,-5000,1,-2000],
'element2':[1,1,1,-5000,2],
'element3':[-5000,1,1,-0.02,2]})
# List of headings containing values to replace
headings = ['element1', 'element2', 'element3']
And I am trying to do something like this (obviously this doesn't work):
# If any rows have value <-1, NaN
df[headings].replace(df[headings < -1], np.nan)
# If a value is between -1 and 0, make a replacement
df[headings].replace(df[headings < 0 & headings > -1], 0.05)
So, is there possibly a better way to accomplish this using loops or fancy pandas tricks?
You can set the Sample column as index and then replace values on the whole data frame based on conditions:
df = df.set_index('Sample')
df[df < -1] = np.nan
df[(df < 0) & (df > -1)] = 0.05
Which gives:
# element1 element2 element3
# Sample
# alpha 1.00 1.0 NaN
# beta 0.05 1.0 1.00
# gamma NaN 1.0 1.00
# delta 1.00 NaN 0.05
# epsilon NaN 2.0 2.00
Here is the successful answer as suggested by #Psidom.
The solution involves taking a slice out of the dataframe, applying the function, then reincorporates the amended slice:
df1 = df.loc[:, headings]
df1[df1 < -1] = np.nan
df1[(df1 < 0)] = 0.05
df.loc[:, headings] = df1