Line-by-line processing of pandas DataFrame

Line-by-line processing of pandas DataFrame - python

Could you please help to solve a specific task. I need to process pandas DataFrame column line-by-line. The main point is that "None" values must be turned into "0" or "1" so as to proceed "0" or "1" values which are already in the column. I've done it by using a "for" loop, and it works correct:
for i in np.arange(1, len(pd['signal'])):
if df.isnull(df['signal'].iloc[i]) and df['signal'].iloc[i-1] == 0:
df['signal'].iloc[i] = 0
if df.isnull(df['signal'].iloc[i]) and df['signal'].iloc[i-1] == 1:
df['signal'].iloc[i] = 1
But, there is the fact that it's not a good approach to iterate the DataFrame.
I tried to use "loc" method, but it brings incorrect results because in this way each step does not consider previously performed results, therefore some "None" values remain unchanged.
df.loc[(df.isnull(df['signal'])) & (df['signal'].shift(1) == 0), 'signal'] = 0
df.loc[(df.isnull(df['signal'])) & (df['signal'].shift(1) == 1), 'signal'] = 1
Does anyone have any idea how to implement this task without a "for" loop?

there are vectorized functions for just this purpose that will be much faster:
df = pd.DataFrame(dict(a=[1,1,np.nan, np.nan], b=[0,1,0,np.nan]))
df.ffill()
# df
a b
0 1.0 0.0
1 1.0 1.0
2 NaN 0.0
3 NaN NaN
# output
a b
0 1.0 0.0
1 1.0 1.0
2 1.0 0.0
3 1.0 0.0

You can use numpy where:
import numpy as np
df['signal'] = np.where(pd.isnull(df['signal']), df['signal'].shift(1), df['signal'])

Related

Improve efficiency of selecting values from dataframe by index

I have a simulation that uses pandas Dataframes to describe objects in a hierarchy. To achieve this, I have used a MultiIndex to show the route to a child object.
Parent df
par_val
a b
0 0.0 0.366660
1.0 0.613888
1 2.0 0.506531
3.0 0.327356
2 4.0 0.684335
0.0 0.013800
3 1.0 0.590058
2.0 0.179399
4 3.0 0.790628
4.0 0.310662
Child df
child_val
a b c
0 0.0 0 0.528217
1.0 0 0.515479
1 2.0 0 0.719221
3.0 0 0.785008
2 4.0 0 0.249344
0.0 0 0.455133
3 1.0 0 0.009394
2.0 0 0.775960
4 3.0 0 0.639091
4.0 0 0.150854
0 0.0 1 0.319277
1.0 1 0.571580
1 2.0 1 0.029063
3.0 1 0.498197
2 4.0 1 0.424188
0.0 1 0.572045
3 1.0 1 0.246166
2.0 1 0.888984
4 3.0 1 0.818633
4.0 1 0.366697
This implies that object (0,0,0) and (0,0,1) in the child Dataframes are both characterised by values at (0,0) in the parent Dataframe.
When a function is performed on the child dataframe for a certain subject of 'a', it may therefore need to grab a value from 'b'. My current solution locates the value from the parent Dataframe by index within the solution function:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
r = range(10, 1000, 10)
dt = []
for i in r:
start = time.time()
df_par = pd.DataFrame(
{'a': np.repeat(np.arange(5), i/5),
'b': np.append(np.arange(i/2), np.arange(i/2)),
'par_val': np.random.rand(i)
}).set_index(['a','b'])
df_child = pd.concat([df_par[[]]] * 2, keys = [0, 1], names = ['c'])\
.reorder_levels(['a', 'b', 'c'])
df_child['child_val'] = np.random.rand(i * 2)
df_child['solution'] = np.nan
def solution(row, df_par, var):
data_level = len(df_par.index.names)
index_filt = tuple([row.name[i] for i in range(data_level)])
sol = df_par.loc[index_filt, 'par_val'] / row.child_val
return sol
a_mask = df_child.index.get_level_values('a') == 0
df_child.loc[a_mask, 'solution'] = df_child.loc[a_mask].apply(solution,
df_par = df_par,
var = 10,
axis = 1)
stop = time.time()
dt.append(stop - start)
plt.plot(r, dt)
plt.show()
The solution function is becoming very costly for large amounts of iterations in the simulation:
(iterations (x) vs time in seconds (y))
Is there a more efficient method of calculating this? I have considered including the 'par_val' in the child df, but I was trying to avoid this as the very large amount of repetitions reduces the amount of simulations I can fit in RAM.

par_val is a float64 which takes 8 bytes for each value. If the child data frame has 1 million rows, that's 8MB of memory (before the OS's Memory Compression feature kicks in). If it has 1 billions rows, then yes, I would worry about the memory impact.
The bigger performance bottleneck though, is in your df_child.loc[a_mask].apply(..., axis=1) line. This makes pandas uses the slow Python loop instead of the much faster vectorized code. In SQL, we call the loop approach row-by-agonizing-row and it's an anti-pattern. You generally want to avoid .apply(..., axis=1) for this reason.
Here's one way to improve the performance without changing df_par or df_child:
a_mask = df_child.index.get_level_values('a') == 0
child_val = df_child.loc[a_mask, 'child_val'].droplevel(-1)
solution = df_par.loc[child_val.index, 'par_val'] / child_val
df_child.loc[a_mask, 'solution'] = solution.to_numpy()
Before:
After:

Manual Feature Engineering in Pandas - Mean of 1 Column vs All Other Columns

Hard to describe this one, but for every column in a dataframe, create a new column that contains the mean of the current column vs the one next to it, then get the mean of that first column vs the next one down the line. Running Python 3.6.
For Example, given this dataframe:
I would like to get this output:
That exact order of the added columns at the end isn't important, but it needs to be able to handle every possible combination of means between all columns, with a depth of 2 (i.e. compare one column to another). Ideally, I would like to have the depth set as a separate variable, so I could have a depth of 3, where it would do this but compare 3 columns to one another.
Ideas? Thanks!
UPDATE
I got this to work, but wondering if there's a more computationally fast way of doing it. I basically just created 2 of the same loops (loop within a loop) to compare 1 column vs the rest, skipping the same column comparisons:
eng_features = pd.DataFrame()
for col in df.columns:
for col2 in df.columns:
# Don't compare same columns, or inversed same columns
if col == col2 or (str(col2) + '_' + str(col)) in eng_features:
continue
else:
eng_features[str(col) + '_' + str(col2)] = df[[col, col2]].mean(axis=1)
continue
df = pd.concat([df, eng_features], axis=1)

Use itertools, a python built in utility package for iterators:
from itertools import permutations
for col1, col2 in permutations(df.columns, r=2):
df[f'Mean_of_{col1}-{col2}'] = df[[col1,col2]].mean(axis=1)
and you will get what you need:
a b c Mean_of_a-b Mean_of_a-c Mean_of_b-a Mean_of_b-c Mean_of_c-a \
0 1 1 0 1.0 0.5 1.0 0.5 0.5
1 0 1 0 0.5 0.0 0.5 0.5 0.0
2 1 1 0 1.0 0.5 1.0 0.5 0.5
Mean_of_c-b
0 0.5
1 0.5
2 0.5

lambda function referencing a column value not specified in function

I have a situation where I want to use the results of a groupby in my training set to fill in results for my test set.
I don't think there's a straight forward way to do this in pandas, so I'm trying use the apply method on the column in my test set.
MY SITUATION:
I want to use the average values from my MSZoning column to infer the missing value for my LotFrontage column.
If I use the groupby method on my training set I get this:
train.groupby('MSZoning')['LotFrontage'].agg(['mean', 'count'])
giving.....
Now, I want to use these values to impute missing values on my test set, so I can't just use the transform method.
Instead, I created a function that I wanted to pass into the apply method, which can be seen here:
def fill_MSZoning(row):
if row['MSZoning'] == 'C':
return 69.7
elif row['MSZoning'] == 'FV':
return 59.49
elif row['MSZoning'] == 'RH':
return 58.92
elif row['MSZoning'] == 'RL':
return 74.68
else:
return 52.4
I call the function like this:
test['LotFrontage'] = test.apply(lambda x: x.fillna(fill_MSZoning), axis=1)
Now, the results for the LotFrontage column are the same as the Id column, even though I didn't specify this.
Any idea what is happening?

you can do it like this
import pandas as pd
import numpy as np
## creating dummy data
np.random.seed(100)
raw = {
"group": np.random.choice("A B C".split(), 10),
"value": [np.nan if np.random.rand()>0.8 else np.random.choice(100) for _ in range(10)]
}
df = pd.DataFrame(raw)
display(df)
## calculate mean
means = df.groupby("group").mean()
display(means)
Fill With Group Mean
## fill with mean value
def fill_group_mean(x):
group_mean = means["value"].loc[x["group"].max()]
return x["value"].mask(x["value"].isna(), group_mean)
r= df.groupby("group").apply(fill_group_mean)
r.reset_index(level=0)
Output
group value
0 A NaN
1 A 24.0
2 A 60.0
3 C 9.0
4 C 2.0
5 A NaN
6 C NaN
7 B 83.0
8 C 91.0
9 C 7.0
group value
0 A 42.00
1 A 24.00
2 A 60.00
5 A 42.00
7 B 83.00
3 C 9.00
4 C 2.00
6 C 27.25
8 C 91.00
9 C 7.00

Pandas groupby mean() not ignoring NaNs

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.

By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN

There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.

Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))

Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf

There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan

I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

Comparing to the next/previous values in loop python dataframe

How to compare values to next or previous items in loop?
I need to summarize consecutive repetitinos of occurences in columns.
After that I need to create "frequency table" so the dfoutput schould looks like on the bottom picture.
This code doesn't work because I can't compare to another item.
Maybe there is another, simple way to do this without looping?
sumrep=0
df = pd.DataFrame(data = {'1' : [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],'2' : [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.index= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # It will be easier to assign repetitions in output df - index will be equal to number of repetitions
dfoutput = pd.DataFrame(0,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],columns=['1','2'])
#example for column 1
for val1 in df.columns[1]:
if val1 == 1 and val1 ==0: #can't find the way to check NEXT val1 (one row below) in column 1 :/
if sumrep==0:
dfoutput.loc[1,1]=dfoutput.loc[1,1]+1 #count only SINGLE occurences of values and assign it to proper row number 1 in dfoutput
if sumrep>0:
dfoutput.loc[sumrep,1]=dfoutput.loc[sumrep,1]+1 #count repeated occurences greater then 1 and assign them to proper row in dfoutput
sumrep=0
elif val1 == 1 and df[val1+1]==1 :
sumrep=sumrep+1
Desired output table for column 1 - dfoutput:
I don't undestand why there is no any simple method to move around dataframe like offset function in VBA in Excel:/

You can use the function defined here to perform fast run-length-encoding:
import numpy as np
def rlencode(x, dropna=False):
"""
Run length encoding.
Based on http://stackoverflow.com/a/32681075, which is based on the rle
function from R.
Parameters
----------
x : 1D array_like
Input array to encode
dropna: bool, optional
Drop all runs of NaNs.
Returns
-------
start positions, run lengths, run values
"""
where = np.flatnonzero
x = np.asarray(x)
n = len(x)
if n == 0:
return (np.array([], dtype=int),
np.array([], dtype=int),
np.array([], dtype=x.dtype))
starts = np.r_[0, where(~np.isclose(x[1:], x[:-1], equal_nan=True)) + 1]
lengths = np.diff(np.r_[starts, n])
values = x[starts]
if dropna:
mask = ~np.isnan(values)
starts, lengths, values = starts[mask], lengths[mask], values[mask]
return starts, lengths, values
With this function your task becomes a lot easier:
import pandas as pd
from collections import Counter
from functools import partial
def get_frequency_of_runs(col, value=1, index=None):
_, lengths, values = rlencode(col)
return pd.Series(Counter(lengths[np.where(values == value)]), index=index)
df = pd.DataFrame(data={'1': [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],
'2': [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.apply(partial(get_frequency_of_runs, index=df.index)).fillna(0)
# 1 2
# 0 0.0 0.0
# 1 1.0 2.0
# 2 2.0 1.0
# 3 0.0 0.0
# 4 1.0 1.0
# 5 0.0 0.0
# 6 0.0 0.0
# 7 0.0 0.0
# 8 0.0 0.0
# 9 0.0 0.0
# 10 0.0 0.0
# 11 0.0 0.0
# 12 0.0 0.0
# 13 0.0 0.0
# 14 0.0 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Line-by-line processing of pandas DataFrame - python

there are vectorized functions for just this purpose that will be much faster: df = pd.DataFrame(dict(a=[1,1,np.nan, np.nan], b=[0,1,0,np.nan])) df.ffill() # df a b 0 1.0 0.0 1 1.0 1.0 2 NaN 0.0 3 NaN NaN # output a b 0 1.0 0.0 1 1.0 1.0 2 1.0 0.0 3 1.0 0.0

You can use numpy where: import numpy as np df['signal'] = np.where(pd.isnull(df['signal']), df['signal'].shift(1), df['signal'])

Related

Improve efficiency of selecting values from dataframe by index

Manual Feature Engineering in Pandas - Mean of 1 Column vs All Other Columns

lambda function referencing a column value not specified in function

Pandas groupby mean() not ignoring NaNs

Comparing to the next/previous values in loop python dataframe

Categories

Resources