Applying values to a DataFrame without using a for-loop - python

I'm looking for a faster method of applying values to a column in a DataFrame. The value is based on two True and False values in the first and second column. This is my current solution:
df['result'] = df.check1.astype(int)
for i in range(len(df)):
if df.result[i] != 1:
df.result[i] = df.result.shift(1)[i] + df.check2[i].astype(int)
Which yields this result:
check1 check2 result
0 True False 1
1 False False 1
2 False False 1
3 False False 1
4 False False 1
5 False False 1
6 False True 2
7 False False 2
8 False True 3
9 False False 3
10 False True 4
11 False False 4
12 False True 5
13 False False 5
14 False True 6
15 False False 6
16 False True 7
17 False False 7
18 False False 7
19 False False 7
20 False True 8
21 False False 8
22 False True 9
23 True False 1
24 False False 1
So the third column needs to be a number based on the value in the row above it.
If check1 is True the number needs to go back to 1. If check2 is true, 1 needs to be added to the number. Otherwise the number stays the same.
The current code is fine but it's taking too long as I need to apply this to a DataFrame with approx. 70.000 rows. I'm pretty sure it can be improved (I'm guessing using the apply function but I'm not sure). Any ideas?

Use pandas.DataFrame.groupby.cumsum:
import pandas as pd
df['result'] = df.groupby(df['check1'].cumsum())[['check1', 'check2']].cumsum().sum(1)
Or #Dan's suggestion:
df['result'] = df.groupby(df['check1'].cumsum())['check2'].cumsum().add(1)
Output:
check1 check2 result
0 True False 1.0
1 False False 1.0
2 False False 1.0
3 False False 1.0
4 False False 1.0
5 False False 1.0
6 False True 2.0
7 False False 2.0
8 False True 3.0
9 False False 3.0
10 False True 4.0
11 False False 4.0
12 False True 5.0
13 False False 5.0
14 False True 6.0
15 False False 6.0
16 False True 7.0
17 False False 7.0
18 False False 7.0
19 False False 7.0
20 False True 8.0
21 False False 8.0
22 False True 9.0
23 True False 1.0
24 False False 1.0

You want to iterate a dataframe using the value of the preceding row. In that case, the most efficient way is to directly iterate the underlying numpy arrays:
df = pd.read_fwf(io.StringIO(t))
df['result'] = df.check1.astype(int)
res = df['result'].values
c1 = df['check1'].values
c2 = df['check2'].values
old = -1
for i in range(len(df)):
if res[i] != 1:
res[i] = old + int(c2[i])
old = res[i]
This works fine because numpy arrays are mutable types, so the changes are reflected in the dataframe.
Timeit says that this is twice as fast as the original solution from #Chris's, and still 1.5 times faster after #Dan's improvement.

Related

Python pandas: How to match data between two dataframes

The first dataframe(df1) is similar to this:
Result
A
B
C
2021-12-31
False
True
True
2022-01-01
False
False
True
2022-01-02
False
True
False
2022-01-03
True
False
True
df2 is an updated version of df1, the date data are new and the column names may be increased, which is similar to this:
Result
A
B
C
D
2022-01-04
False
False
True
True
2022-01-05
True
False
True
True
2022-01-06
False
True
False
True
2022-01-07
False
False
True
True
I want to integrate two databases, but I don't know how to do it。
I want to get a result similar to the following:
Result
A
B
C
D
2021-12-31
False
True
True
NaN
2022-01-01
False
False
True
NaN
2022-01-02
False
True
False
NaN
2022-01-03
True
False
True
NaN
2022-01-04
False
False
True
True
2022-01-05
True
False
True
True
2022-01-06
False
True
False
True
2022-01-07
False
False
True
True
Thank you very much!
Use the concatenate function while ignoring indexes
df_new = pd.concat([df1, df2], ignore_index=True)
Any missing values will be 'NaN'.

How to retrieve pandas dataframe rows surrounding rows with a True boolean?

Suppose I have a df of the following format:
Assumed line 198 would be True for rot_mismatch, what would be the best way to retrieve the True line (easy) and the line above and below (unsolved)?
I have multiple lines with a True boolean and would like to automatically create a dataframe for closer investigation, always including the True line and its surrounding lines.
Thanks!
Edit for clarification:
exemplary input:
id
name
Bool
1
Sta
False
2
Danny
True
3
Elle
False
4
Rob
False
5
Dan
False
6
Holger
True
7
Mat
True
8
Derrick
False
9
Lisa
False
desired output:
id
name
Bool
1
Sta
False
2
Danny
True
3
Elle
False
5
Dan
False
6
Holger
True
7
Mat
True
8
Derrick
False
Assuming this input:
col1 rot_mismatch
0 A False
1 B True
2 C False
3 D False
4 E False
5 F False
6 G True
7 H True
to get the N rows before/after any True, you can use a rolling operation to compute a mask for boolean indexing:
N = 1
mask = (df['rot_mismatch']
.rolling(2*N+1, center=True, min_periods=1)
.max().astype(bool)
)
df2 = df.loc[mask]
output:
# N = 1
col1 rot_mismatch
0 A False
1 B True
2 C False
5 F False
6 G True
7 H True
# N = 0
col1 rot_mismatch
1 B True
6 G True
7 H True
# N = 2
col1 rot_mismatch
0 A False
1 B True
2 C False
3 D False
4 E False
5 F False
6 G True
7 H True
Try with shift:
>>> df[df["rot_mismatch"]|df["rot_mismatch"].shift()|df["rot_mismatch"].shift(-1)]
dep_ap_sched arr_ap_sched rot_mismatch
120 East Carmen South Nathaniel False
198 South Nathaniel East Carmen True
289 East Carmen Joneshaven False
Output for amended example:
>>> df[df["Bool"]|df["Bool"].shift()|df["Bool"].shift(-1)]
id name Bool
0 1 Sta False
1 2 Danny True
2 3 Elle False
4 5 Dan False
5 6 Holger True
6 7 Mat True
7 8 Derrick False
Is it what you want ?
df_true=df.loc[df['rot_mismatch']=='True',:]
df_false=df.loc[df['rot_mismatch']=='False',:]

Pandas and Numpy consecutive non Nan values

I'm trying to use np.where to count consecutive non-NaN values longer than a certain length as shown below:
e.g. If there are more than 3 consecutive non-NaN values then return True.
Would appreciate any help!
value
consecutive
nan
False
nan
False
1
False
1
False
nan
False
4
True
2
True
3
True
nan
False
nan
False
1
True
3
True
3
True
5
True
The idea is to create groups by testing missing values and mapping using Series.map with Series.value_counts to have only rows with non NaNs filtered by inverted mask ~m:
#convert values to numeric
df['value'] = df['value'].astype(float)
m = df['value'].isna()
s = m.cumsum()
N = 3
df['new'] = s.map(s[~m].value_counts()).ge(N) & ~m
print (df)
value consecutive new
0 NaN False False
1 NaN False False
2 1.0 False False
3 1.0 False False
4 NaN False False
5 4.0 True True
6 2.0 True True
7 3.0 True True
8 NaN False False
9 NaN False False
10 1.0 True True
11 3.0 True True
12 3.0 True True
13 5.0 True True

Pandas: How to compute a conditional rolling/accumulative maximum within a group

I would like to achieve the following results in the column condrolmax (based on column close) (conditional rolling/accumulative max) without using a stupidly slow for loop.
Index close bool condrolmax
0 1 True 1
1 3 True 3
2 2 True 3
3 5 True 5
4 3 False 5
5 3 True 3 --> rolling/accumulative maximum reset (False cond above)
6 4 True 4
7 5 False 4
8 7 False 4
9 5 True 5 --> rolling/accumulative maximum reset (False cond above)
10 7 False 5
11 8 False 5
12 6 True 6 --> rolling/accumulative maximum reset (False cond above)
13 8 True 8
14 5 False 8
15 5 True 5 --> rolling/accumulative maximum reset (False cond above)
16 7 True 7
17 15 True 15
18 16 True 16
The code to create this dataframe:
# initialise data of lists.
data = {'close':[1,3,2,5,3,3,4,5,7,5,7,8,6,8,5,5,7,15,16],
'bool':[True, True, True, True, False, True, True, False, False, True, False,
False, True, True, False, True, True, True, True],
'condrolmax': [1,3,3,5,5,3,4,4,4,5,5,5,6,8,8,5,7,15,16]}
# Create DataFrame
df = pd.DataFrame(data)
I am sure it is possible to vectorize that (one liner). Any suggestions ?
Thanks again !
You can set group and then use cummax(), as follows:
# Set group: New group if current row `bool` is True and last row `bool` is False
g = (df['bool'] & (~df['bool']).shift()).cumsum()
# Get cumulative max of column `close` within the group
df['condrolmax'] = df.groupby(g)['close'].cummax()
Result:
print(df)
close bool condrolmax
0 1 True 1
1 3 True 3
2 2 True 3
3 5 True 5
4 3 False 5
5 3 True 3
6 4 True 4
7 5 False 5
8 7 False 7
9 5 True 5
10 7 False 7
11 8 False 8
12 6 True 6
13 8 True 8
14 5 False 8
15 5 True 5
16 7 True 7
17 15 True 15
18 16 True 16
First make groups using your condition (bool changing from False to True) and cumsum, then apply your rolling after a groupby:
group = (df['bool']&(~df['bool']).shift()).cumsum()
df.groupby(group)['close'].rolling(2, min_periods=1).max()
output:
0 0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
1 5 3.0
6 4.0
7 5.0
8 7.0
2 9 5.0
10 7.0
11 8.0
3 12 6.0
13 8.0
14 8.0
4 15 5.0
16 7.0
17 15.0
18 16.0
Name: close, dtype: float64
To insert back as a column:
df['condrolmax'] = df.groupby(group)['close'].rolling(2, min_periods=1).max().droplevel(0)
output:
close bool condrolmax
0 1 True 1.0
1 3 True 3.0
2 2 True 3.0
3 5 True 5.0
4 3 False 5.0
5 3 True 3.0
6 4 True 4.0
7 5 False 5.0
8 7 False 7.0
9 5 True 5.0
10 7 False 7.0
11 8 False 8.0
12 6 True 6.0
13 8 True 8.0
14 5 False 8.0
15 5 True 5.0
16 7 True 7.0
17 15 True 15.0
18 16 True 16.0
NB. if you want the boundary to be included in the rolling, use min_periods=1 in rolling
I'm not sure how we can use linear algebra and vectorizing to make this function faster, but using list comprehension, we write a faster algorithm. First, define the function as:
def faster_condrolmax(df):
df['cond_index'] = [df.index[i] if df['bool'][i]==False else 0 for i in
df.index]
df['cond_comp_index'] = [np.max(df.cond_index[0:i]) for i in df.index]
df['cond_comp_index'] = df['cond_comp_index'].fillna(0).astype(int)
df['condrolmax'] = np.zeros(len(df.close))
df['condrolmax'] = [np.max(df.close[df.cond_comp_index[i]:i]) if
df.cond_comp_index[i]<i else df.close[i] for
i in range(len(df.close))]
return df
Then, you can use:
!pip install line_profiler
%load_ext line_profiler
to add and load the line profiler and see how long each line of the code takes with this:
%lprun -f faster_condrolmax faster_condrolmax(df)
which will result as:
Each line profiling results
of just see how long the whole function takes:
%timeit faster_condrolmax(df)
which will result as:
Total algorithm profiling result
If you use the SeaBean's function, you can get better results half the speed it takes for my proposed functions. However, the speed estimated for SeaBean's doesn't seem robust, and to estimate his functions, you should run it on a larger dataset and then decide. That's all because %timeit reports like this:
SeaBean's function profiling result

pandas groupby column and check if group meets multiple conditions

I have a DataFrame that looks like the following:
X Y Date are_equal
0 50.0 10.0 2018-08-19 False
1 NaN 10.0 2018-08-19 False
2 NaN 50.0 2018-08-19 True
3 10.0 NaN 2018-08-21 False
4 1.0 NaN 2018-08-19 False
5 NaN 10.0 2018-08-22 False
6 10.0 NaN 2018-08-21 False
The are_equal column indicates that a value in Y is in X for the same date (in this case 50.0).
I am trying to group by date and find whether X contains a specific value (say 1.0) for a date that contains are_equal True.
My approach was to use df.iterrows() and get the row at next index after meeting the condition df['are_equal'] == True. However, the rows aren't necessarily ordered.
How can I group by Date and check if a date contains True in are_equal and 1.0 in column X for that same date?
The output I'm trying to achieve is a new Boolean column that looks like this:
contains_specific_value
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Let us do apply, this can add more condition but slow. You can check the other solution from transform
df['New']=df.groupby('Date').apply(lambda x : (x['X']==1)&x['are_equal'].any()).reset_index(level=0,drop=True)
df
Out[101]:
X Y Date are_equal New
0 50.0 10.0 2018-08-19 False False
1 NaN 10.0 2018-08-19 False False
2 NaN 50.0 2018-08-19 True False
3 10.0 NaN 2018-08-21 False False
4 1.0 NaN 2018-08-19 False True
5 NaN 10.0 2018-08-22 False False
6 10.0 NaN 2018-08-21 False False
Or transform
df['X'].eq(1)&(df.groupby('Date').X.transform('any'))
Out[102]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Name: X, dtype: bool

Categories

Resources