Loop and Accumulate Sum from Pandas Column Made of Lists - python

Currently, my Pandas data frame looks like the following
Row_X
["Medium, "High", "Low"]
["Medium"]
My intention is to iterate through the list in each row such that:
summation = 0
for value in df["Row_X"]:
if "High" in value:
summation = summation + 10
elif "Medium" in value:
summation = summation + 5
else:
summation= summation + 0
Finally, I wish to apply this to each and create a new column that looks like the following:
Row_Y
15
10
My assumption is that either np.select() or apply() can play into this but thus far have encountered errors with implementing either.

We can do:
mapper = {'Medium' : 5, 'High' : 10}
df['Row_Y'] = [sum([mapper[word] for word in l
if word in mapper])
for l in df['Row_X']]
If pandas version > 0.25.0 We can use
df['Row_Y'] = df['Row_X'].explode().map(mapper).sum(level=0)
print(df)
Row_X Row_Y
0 [Medium, High, Low] 15
1 [Medium] 5

Maybe on a cleaner side, convert to a Series and directly use map
mapp = {'Medium' : 5, 'High' : 10}
df['Row_Y'] = df['Row_X'].apply(lambda x: pd.Series(x).map(mapp).sum())
df
Row_X Row_Y
0 [Medium, High, Low] 15.0
1 [Medium] 5.0

Map your function to the series
import pandas as pd
def function(x):
summation = 0
for i in x:
if "High" in i:
summation += 10
elif "Medium" in i:
summation += 5
else:
summation += 0
return summation
df = pd.DataFrame({'raw_x': [['Medium', 'High', 'Low'], ['Medium']]})
df['row_y'] = df['raw_x'].map(function)
You can do it in a shorter format with
mapping = {'High': 10, 'Medium': 5, 'Low': 0}
df['raw_y'] = df['raw_x'].map(lambda x: sum([mapping[i] if i in mapping else 0 for i in x]))
print(df)
raw_x row_y
0 [Medium, High, Low] 15
1 [Medium] 5

This solution should work -
vals = {"High":10, "Medium":5, "Low":0}
df['Row_Y'] = df.apply(lambda row : sum(vals[i] for i in row['Row_X']) ,axis=1)

Related

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)
Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3
Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.

Iterate through df rows and sum values of two columns separately until condition is met on one of those columns

I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers
I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2
Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.

How can I fill NaN values in a dataframe with the average of the values above it?

I'm looking to make it so that NaN values in a dataframe are filled in by the mean of all the values up to that point, as such:
A
0 1
1 2
2 3
3 4
4 5
5 NaN
6 NaN
7 11
8 NaN
Would become
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
You can solve it by running the following code
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": [ 1, 2, 3, 4, 5, pd.NA, pd.NA, 11, pd.NA ]
})
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
It iterates on each NaN and fills it with the mean of the previous values, including those filled NaNs.
At the end you will have:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
EDIT
As stated by RichieV, performance may be an issue with this solution (its runtime complexity is O(N^2)) when there are many NaNs, but we also should avoid python iterations, since they are slow when compared to native pandas / numpy calls.
Here is an optimized version:
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = int(cumsum / cumnum)
last_idx = idx
Result:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
Since in the worst case the script should pass on the dataframe twice, the runtime complexity is now O(N).
Marco's answer works fine but it can be optimized with incremental average formulas, from math.stackexchange.com
Here is an adaptation of that other question (not the exact formula, just the concept).
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
The main advantage with this code is not having to read all items up to the current index on each iteration to get the mean.
This option still uses a python loop--which is not the best choice with pandas--but there seems to be no way around it for this use case (hopefully someone will get inspired by this and find such method without a loop).
Performance tests
Three alternative functions were defined:
incremental: My answer.
from_origin: Marco's original answer.
incremental_pandas: Marco's updated answer.
Tests were done using timeit module with 3 repetitions on random samples with 0.4 probability of NaN.
Full code for testing
import pandas as pd
import numpy as np
import timeit
import collections
from matplotlib import pyplot as plt
def incremental(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
return df
def incremental_pandas(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = cumsum / cumnum
last_idx = idx
return df
def from_origin(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
return df
def get_random_sample(n, p):
np.random.seed(123)
return pd.DataFrame({'A':
np.random.choice(list(range(10)) + [np.nan],
size=n, p=[(1 - p) / 10] * 10 + [p])})
r = 3
p = 0.4 # portion of NaNs
# check result from all functions
results = []
for func in [from_origin, incremental, incremental_pandas]:
random_df = get_random_sample(1000, p)
new_df = random_df.copy(deep=True)
results.append(func(new_df))
print('Passed' if all(np.allclose(r, results[0]) for r in results[1:])
else 'Failed', 'implementation test')
timings = {}
for n in np.geomspace(10, 10000, 10):
random_df = get_random_sample(int(n), p)
timings[n] = collections.defaultdict(float)
results = {}
for func in ['incremental', 'from_origin', 'incremental_pandas']:
timings[n][func] = (
timeit.timeit(f'{func}(random_df.copy(deep=True))', number=r, globals=globals())
/ r
)
timings = pd.DataFrame(timings).T
print(timings)
timings.plot()
plt.xlabel('size of array')
plt.ylabel('avg runtime (s)')
plt.ylim(0)
plt.grid(True)
plt.tight_layout()
plt.show()
plt.close('all')

Python sum it twice for groupby

I'm struggling with a series sum after already grouped the dataframe, and I was hoping that someone could please help me with an idea.
Basically I have in the example below I need to have the sum per each "Material".
Basically Material "ABC" should give me 2, and all the others as they have only one sign operation would have the same value.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Material" : ["M-12", "H4-LAMPE", "M-12", "H4-LAMPE",
"ABC" , "H4-LAMPE", "ABC", "ABC"] ,
"Quantity" : [6, 1, 3, 5, 1, 1, 10, 9],
"TYPE": ["+", "-", "+", "-", "+", "-", "+", "-"]})
df.groupby(['Material', "Quantity"], as_index=False).count()
listX = []
for item in df["TYPE"]:
if item == "+":
listX.append(1)
elif item == "-":
listX.append(-1)
else:
pass
df["Sign"] = lista
df["MovementsQty"] = df["Quantity"]*df["Sign"]
#df = df.groupby(["Material", "TYPE", "Quantity1"]).sum()
df1 = df.groupby(["Material", "TYPE"]).sum()
df1.drop(columns=["Quantity", "Sign"], inplace=True)
print(df1)
The result is:
The desired result is:
I tried to sum it again, to consider it differently but I was not successful so far and I think I need some help.
Thank you very much for your help
You're on the right track. I've tried to improve your code. Just use "Type" to determine and assign the sign using np.where, perform groupby and sum, and then re-compute the "Type" column based on the result.
v = (df.assign(Quantity=np.where(df.TYPE == '+', df.Quantity, -df.Quantity))
.groupby('Material', as_index=False)[['Quantity']]
.sum())
v.insert(1, 'Type', np.where(np.sign(v.Quantity) == 1, '+', '-'))
print (v)
Material Type Quantity
0 ABC + 2
1 H4-LAMPE - -7
2 M-12 + 9
Alternatively, you can do this with two groupby calls:
i = df.query('TYPE == "+"').groupby('Material').Quantity.sum()
j = df.query('TYPE == "-"').groupby('Material').Quantity.sum()
# Find the union of the indexes.
idx = i.index.union(j.index)
# Reindex and subtract.
v = i.reindex(idx).fillna(0).sub(j.reindex(idx).fillna(0)).reset_index()
# Insert the Type column back into the result.
v.insert(1, 'Type', np.where(np.sign(v.Quantity) == 1, '+', '-'))
print(v)
Material Type Quantity
0 ABC + 2.0
1 H4-LAMPE - -7.0
2 M-12 + 9.0
Here is another take (fairly similar to coldspeed though).
#Correct quantity with negative sign (-) according to TYPE
df.loc[df['TYPE'] == '-', 'Quantity'] *= -1
#Reconstruct df as sum of quantity to remove dups
df = df.groupby('Material')['Quantity'].sum().reset_index()
df['TYPE'] = np.where(df['Quantity'] < 0, '-', '+')
print(df)
Returns:
Material Quantity TYPE
0 ABC 2 +
1 H4-LAMPE -7 -
2 M-12 9 +
map and numpy.sign
Just sum up Quantity * TYPE and figure out the sign afterwards.
d = {'+': 1, '-': -1}
r = dict(map(reversed, d.items())).get
q = df.Quantity
m = df.Material
t = df.TYPE
s = pd.Series((q * t.map(d)).values, m, name='MovementsQty').sum(level=0)
s.reset_index().assign(TYPE=lambda x: [*map(r, np.sign(x.MovementsQty))])
Material MovementsQty TYPE
0 M-12 9 +
1 H4-LAMPE -7 -
2 ABC 2 +

Assign columns' value from other columns in Pandas dataframe

How do i assign columns in my dataframe to be equal to another column if/where condition is met?
Update
The problem
I need to assign many columns values (and sometimes a value from another column in that row) when the condition is met.
The condition is not the problem.
I need an efficient way to do this:
df.loc[some condition it doesn't matter,
['a','b','c','d','e','f','g','x','y']]=df['z'],1,3,4,5,6,7,8,df['p']
Simplified example data
d = {'var' : pd.Series([10,61]),
'c' : pd.Series([100,0]),
'z' : pd.Series(['x','x']),
'y' : pd.Series([None,None]),
'x' : pd.Series([None,None])}
df=pd.DataFrame(d)
Condition if var is not missing and first digit is less than 5
Result make df.x=df.z & df.y=1
Here is psuedo code that doesn't work, but it is what I would want.
df.loc[((df['var'].dropna().astype(str).str[0].astype(int) < 5)),
['x','y']]=df['z'],1
but i get
ValueError: cannot set using a list-like indexer with a different length than the value
ideal output
c var x z y
0 100 10 x x 1
1 0 61 None x None
The code below works, but is too inefficient because i need to assign values to multiple columns.
df.loc[((df['var'].dropna().astype(str).str[0].astype(int) < 5)),
['x']]=df['z']
df.loc[((df['var'].dropna().astype(str).str[0].astype(int) < 5)),
['y']]=1
You can work row wise:
def f(row):
if row['var'] is not None and int(str(row['var'])[0]) < 5:
row[['x', 'y']] = row['z'], 1
return row
>>> df.apply(f, axis=1)
c var x y z
0 100 10 x 1 x
1 0 61 None NaN x
To overwrite the original df:
df = df.apply(f, axis=1)
This is one way of doing it:
import pandas as pd
import numpy as np
d = {'var' : pd.Series([1,6]),
'c' : pd.Series([100,0]),
'z' : pd.Series(['x','x']),
'y' : pd.Series([None,None]),
'x' : pd.Series([None,None])}
df = pd.DataFrame(d)
# Condition 1: if var is not missing
cond1 = ~df['var'].apply(np.isnan)
# Condition 2: first number is less than 5
cond2 = df['var'].apply(lambda x: int(str(x)[0])) < 5
mask = cond1 & cond2
df.ix[mask, 'x'] = df.ix[mask, 'z']
df.ix[mask, 'y'] = 1
print df
Output:
c var x y z
0 100 1 x 1 x
1 0 6 None None x
As you can see, the Boolean mask has to be applied on both side of the assignment, and you need to broadcast the value 1 on the y column. It is probably cleaner to split the steps into multiple lines.
Question updated, edit: More generally, since some assignments depend on the other columns, and some assignments are just broadcasting along the column, you can do it in two steps:
df.loc[conds, ['a','y']] = df.loc[conds, ['z','p']]
df.loc[conds, ['b','c','d','e','f','g','x']] = [1,3,4,5,6,7,8]
You may profile and see if this is efficient enough for your use case.

Categories

Resources