Pandas assign - passing column in a user defined function

Pandas assign - passing column in a user defined function - python

Given an input dataframe and string:
df = pd.DataFrame({"A" : [10, 20, 30], "B" : [0, 1, 8]})
colour = "green" #or "red", "blue" etc.
I want to add a new column df["C"] conditional on the values in df["A"], df["B"] and colour so it looks like:
df = pd.DataFrame({"A" : [4, 2, 10], "B" : [1, 4, 3], "C" : [True, True, False]})
So far, I have a function that works for just the input values alone:
def check_passing(colour, A, B):
if colour == "red":
if B < 5:
return True
else:
return False
if colour == "blue":
if B < 10:
return True
else:
return False
if colour == "green":
if B < 5:
if A < 5:
return True
else:
return False
else:
return False
How would you go about using this function in df.assign() so that it calculates this for each row? Specifically, how do you pass each column to check_passing()?
df.assign() allows you to refer to the columns directly or in a lambda, but doesn't work within a function as you're passing in the entire column:
df = df.assign(C = check_passing(colour, df["A"], df["B"]))
Is there a way to avoid a long and incomprehensible lambda? Open to any other approaches or suggestions!

Applying a function like that can be inefficient, especially when dealing with dataframes with many rows. Here is a one-liner:
colour = "green" #or "red", "blue" etc.
df['C'] = ((colour == 'red') & df['B'].lt(5)) | ((colour == 'blue') & df['B'].lt(5)) | ((colour == 'green') & df['B'].lt(5) & df['A'].lt(5))

Related

What kind of loop or function can be used to simplify this code?

Is there a loop or function that I can use in Python to simplify down this code, so that I can go through an entire list and return the result of each iteration? I need to apply the value of the row index to iloc and as a function parameter.
I've tried while loops, but I get <function __main__.()> in return when I put it in a function.
IndexList = [0,1,2,3,4,5,6,...,99]
IndexPosition = 0
row_index = IndexList.index[IndexPosition]
Result = dfColumn1.iloc[row_index] / dfFunction(row_index)
Result
#Output
IndexPosition = 1
row_index = IndexList.index[IndexPosition]
Result = dfColumn1.iloc[row_index] / dfFunction(row_index)
Result
#Output
IndexPosition = 2
row_index = IndexList.index[IndexPosition]
Result = dfColumn1.iloc[row_index] / dfFunction(row_index)
Result
#Output
etc
Ideally, I'd like it so the output is:
#Output of function from Index Position 1
#Output of function from Index Position 2
#Output of function from Index Position 3
#Output of function from Index Position 4

I think you can use a for(each) loop with index using the built-in function enumerate():
for value, index in enumerate(IndexList):
print(value, index)

You might want to take a few steps back from the columns and indexes you extracted. Pandas dataframes offer a lot of direct ways to iterate over the columns or cells themselves.
import pandas as pd
df = pd.DataFrame({'x': {0: 1, 1: 200, 2: 4, 3: 5, 4: 6},
'y': {0: 4, 1: 5, 2: 10, 3: 24, 4: 4},
'z': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}
})
results = []
for idx, x in enumerate(df.iloc[:, 1]):
results.append(df.iloc[idx] / dfFunction(x))
for res in results:
print(res)

Chaining Pandas DataFrame Styles

*edited DataFrame random generator
I have 2 dfs, one used as a mask for the other.
rndm = pd.DataFrame(np.random.randint(0,15,size=(100, 4)), columns=list('ABCD'))
rndm_mask = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
I want to use 2 conditions to change the values in rndm:
Is the value the mode of the column?
rndm_mask == 1
What works so far:
def colorBoolean(val):
return f'background-color: {"red" if val else ""}'
rndm.style.apply(lambda _: rndm_mask.applymap(colorBoolean), axis=None)
# helper function to find Mode
def highlightMode(s):
# Get mode of columns
mode_ = s.mode().values
# Apply style if the current value is in mode_ array (len==1)
return ['background-color: yellow' if v in mode_ else '' for v in s]
Issue:
I'm unsure how to chain both functions in a way that values in rndm are highlighted only if they match both criteria (ie. value must be the most frequent value in column as well as be True in rndm_mask).
I appreciate any advice! Thanks

Try this, since your df_bool dataframe is a mask (identically indexed) then you can referred to the df_bool object inside the style function, where x.name is the name of the column passed in via df.apply:
df = pd.DataFrame({'A':[5.5, 3, 0, 3, 1],
'B':[2, 1, 0.2, 4, 5],
'C':[3, 1, 3.5, 6, 0]})
df_bool = pd.DataFrame({'A':[0, 1, 0, 0, 1],
'B':[0, 0, 1, 0, 0],
'C':[1, 1, 1, 0, 0]})
# I want to use 2 conditions to change the values in df:
# Is the value the mode of the column?
# df_bool == 1
# What works so far:
def colorBoolean(x):
return [f'background-color: red' if v else '' for v in df_bool[x.name]]
# helper function to find Mode
def highlightMode(s):
# Get mode of columns
mode_ = s.mode().values
# Apply style if the current value is in mode_ array (len==1)
return ['background-color: yellow' if v in mode_ else '' for v in s]
df.style.apply(colorBoolean).apply(highlightMode)
Output:
Or the other way:
df.style.apply(highlightMode).apply(colorBoolean)
Output:
Update
Highlight where both are true:
def highlightMode(s):
# Get mode of columns
mode_ = s.mode().values
# Apply style if the current value is in mode_ array (len==1)
return ['background-color: yellow' if (v in mode_) & b else '' for v, b in zip(s, df_bool[s.name])]
df.style.apply(highlightMode)
Output:

Multiple field boolean filtering by value with Django

I have 3 fields in a dataset that I want to create querysets over the possible permutation values for those fields. These fields are based on a numerical rating scale, so the queries are based on whether a value is matched in the given field.
I've already filtered the fields to ensure that the values in all three fields are at minimum a 7 (out of the possible integer values of 7, 8, 9 for those three fields).
Now I want to find the following potential comparisons as querysets (!7 means the value can be 8 or 9):
[7, 7, 7]
[7, !7, 7]
[7, 7, !7]
[!7, 7, 7]
[7, !7, !7]
[!7, 7, !7]
[!7, !7, 7]
I was able to do this with pandas by creating comparison columns with boolean evaluation to check the following permutations, (i.e. column 1 = 7, column 2 != 7 (so either 8 or 9), column 3 != 7 (so either 8 or 9)).
permutation_dict = {
"all": (True, True, True),
"one and three": (True, False, True),
"one and two": (True, True, False),
"two and three": (False, True, True),
"one": (True, False, False),
"two": (False, True, False),
"three": (False, False, True),
}
This permutation dictionary is then looped over to create a comparison column which I could then get the count of "Trues" for a given permutation dictionary entry.
df = pd.DataFrame(queryset.values_list(
"one",
"two",
"three"
)
)
bool_df = df.apply(lambda x: x == 7 if x.name in [0, 1, 2] else x)
v = permutation_dict["all"]
comparison_column = bool_df[
(bool_df[0] == v[0])
& (bool_df[1] == v[1])
& (bool_df[2] == v[2])]
Is there a way to do a similar operation with django filters? While I get that I could chain queryset.filter(one=7).filter(two!=7).filter(three(!=7) to replicate permutation_dict["one"], it seems like I would have to hardcode the filter expression (eq vs neq) to get the same result.

You can use Q:
A Q object (django.db.models.Q) is an object used to encapsulate a collection of keyword arguments. These keyword arguments are specified as in “Field lookups” above.
# create initial data for the sample
M1.objects.create(i1 = 7, i2 = 7, i3 =7)
M1.objects.create(i1 = 7, i2 = 8, i3 =7)
M1.objects.create(i1 = 7, i2 = 9, i3 =7)
# importing Q
from django.db.models import Q
# Encapsulating:
q1 = Q(i1 = 7)
q2 = Q(i2 = 7)
q3 = Q(i3 = 7)
# Permutation dict, complex lookups:
permutation_dict = {
"all": q1 & q2 & q3,
"one and three": q1 & ~q2 & q3,
# ...
}
# Counting
M1.objects.filter( permutation_dict["all"] ).count()
# return 1
M1.objects.filter( permutation_dict["one and three"] ).count()
# returns 2
I don't know if this is exactly what you are looking for, but I'm pretty sure this example will help you find the way.

How can I fill in "nan" values conditionally?

I want to fill missing values with like this:
data = pd.read_csv("E:\\SPEED.csv")
Data - DataFrame
Case - 1
if flcass= "motorway", "motorway_link", "trunk" or "trunk_link"
I want to replace the text "nan" with 110
Case - 2
if flcass= "primary", "primary_link", "secondary" or "secondary_link"
I want to replace the text "nan" with 70
Case - 3
if "fclass" is another value, I want to change it to 40.
I would be grateful for any help.

Two ways in pandas:
df = DataFrame(
{
"A": [1, 2, np.nan, 4],
"B": [1, 4, 9, np.nan],
"C": [1, 2, 3, 5],
"D": list("abcd"),
}
)
fillna lets you fill NA's (or NaNs) with what appears to be a fixed value:
df['B'].fillna(12)
[1,4,9,12]
interpolate uses scipy's interpolation methods -- linear by default:
df.interpolate()
df['A']
[1,2,3,4]

Thank you all for your answers. However, as there are 6812 rows and 16 columns (containing nan values) in the data, it seems that different solutions are required.

You can try this
import pandas as pd
import math
def valuesMapper(data, valuesDict, columns_to_update):
for i in columns_to_update:
data[i] = data[i].apply(lambda x: valuesDict.get(x, 40) if math.isnan(x) else x)
return data
data = pd.read_csv("E:\\SPEED.csv")
valuesDict = {"motorway":110, "motorway_link":110, "trunk":110, "primary":70, "primary_link":70, "secondary":70, "secondary_link":70}
column_to_update = ['AGU_PZR_07_10'] #columns_to_update is the list of columns to be updated, you can get it through code didn't added that as i dont have your data
print(valuesMapper(data, valuesDict, columns_to_update))

With the below example:
data = pandas.DataFrame({
'flclass': ['a', 'b', 'c', 'a'],
'AGU': [float('nan'), float('nan'), float('nan'), 9]
})
You can update it using numpy conditionals iterating over your columns starting from 2nd ([1:]) - 5th ([4:]) in your data:
for column in data.columns[1:]:
data[column] = np.where((data['flclass'] == 'b') & (data[column].isna()), 110, data[column])
Or panadas apply:
import numpy as np
data['AGU'] = data.apply(
lambda row: 110 if np.isnan(row['AGU']) and row['flclass'] in ("b","a") else row['AGU'],
axis=1,
)
where you can replace ("b","a") with eg ("motorway", "motorway_link", "trunk", "trunk_link")

Dividing rows when iterating in pandas using iterrows

I have a flask app where I am getting back data and transformed into pandas Dataframe.
if request.method == 'PUT':
content = request.get_json(silent=True)
df = pd.DataFrame.from_dict(content)
for index, row in df.iterrows():
if row["label"] == True:
row['A'] = row['B'] / row['C']
elif row["label"] == False:
row['A'] = row["B"]
if row['D'] == 0:
row['C'] = 0
else:
...
What i am trying to do here is simple arithmetic like addition, subtraction & division.
I have used iterrows() mainly because i needed multiple values to iterate and perform calculations on specific row values. df['..'].item() didn't work in my use case.
Addition and subtraction work fine but division seems to slip out somehow and always returns values like 0, -1, 1
Example calculation
row['A'] = row['B'] / row['C']
Most of the time the values of row['B'] is lesser than row['C']. Example values
row['A'] = 1232455 / 26719856
The only calculation involved in the app are addition, subtraction & division.

you can try this (Here is an example):
import pandas as pd
import numpy as np
data = {'label': [True, False, True, True, False],
'A': [2012, 2012, 2013, 2014, 2014],
'B': [4, 24, 31, 21, 3],
'C': [25, 94, 57, 62, 70],
'D': [3645, 0, 27, 24, 96]}
df = pd.DataFrame(data)
You can apply your changes directly on your main Dataframe without being required to iterate over each row each time like this :
# select only rows with label == True and apply the division function
df.loc[df.label == True, 'A'] = df['B']/df['C']
df.loc[df.label == False, 'A'] = df['B']
df.loc[np.logical_and(df.label == False, df.D == 0), 'C'] = 0
.
.
.
You can select each time the row you want to change and apply the changes directly on. Just like i did.
Another point :
After applying the division in my example the integers are transformed into float64 you can try on your example the function series.astype('flat64')
for row['A'] = 1232455 / 26719856 you will get 0.046125 and not only the integer part 0 .
Maybe it will save you from having zeros every time when you do the divisions

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas assign - passing column in a user defined function - python

Related

What kind of loop or function can be used to simplify this code?

Chaining Pandas DataFrame Styles

Multiple field boolean filtering by value with Django

How can I fill in "nan" values conditionally?

Dividing rows when iterating in pandas using iterrows

Categories

Resources