Conditional w/Pandas Dataframe - python

Need a help with conditionals for pandas dataframe. Apologies in advance for the basic question or if it's covered elsewhere.
Here's the example dataframe:
employee sales revenue salary
12345 20 10000 100000
I have a few conditions based on data which will result in salary changing.
scenarios:
if sales >10 and revenue > $5,000, increase salary by 20%
if sales <5 and revenue > $5,000, increase salary by 10%
otherwise, do nothing.
variables:
high_sales = 10
low_sales = 5
high_revenue = 5000
big_increase = 1.2
small_increase = 1.1
I know this requires some nesting but it's not clear to me how to do it.
I want the outcome to be a dataframe with only the salary column adjusted.
Here's the code:
df['salary'] = np.where((df['sales']>=high_sales & df['revenue']
>=high_revenue), df['salary'] * big_increase, (df['sales']<=low_sales &
df['revenue'] >=high_revenue), df['salary'] * small_increase, df['sales'])
Is this right?

With multiple conditions, it's nicer to use np.select rather than np.where:
conds = [(df.sales > 10) & (df.revenue > 5000),
(df.sales < 5) & (df.revenue > 5000)]
choices = [df.salary * 1.2, df.salary * 1.1]
df['salary'] = np.select(conds, choices, default = df.revenue)

Related

A way to make the [apply + lambda + loc] efficient

The dataframe(contains data on the 2016 elections), loaded in pandas from a .csv has the following structure:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
The idea would be to calculate the first X counties with the highest percentage of votes in favor of candidate X (removing Totals)
For example suppose we want 100 counties, and the candidate is Trump, the operation to be carried out is: 100 * sum of votes for Trump / total votes
I have implemented the following code, getting correct results:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
Using %%time i realized that it is quite slow:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
Is there a way to make it faster?
You can try to amend your codes to use only vectorized operations to speed up the process, like below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
Edit:
If you want the top 100 counties with the highest percentages, you can slightly change the codes below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate
you can try:
Supposing you don't have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
Supposing you have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
I could not test it because I don`t have the data but it doesn't use any apply which could increase the performance
for the rename of the columns you can use .rename(columns={'votes':'percentage'}) at the end

How can you increase the speed of an algorithm that computes a usage streak?

I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak

Filtering with Pandas

I am playing with the following dataset (which is basically a dataset representing the number of gunshot deaths in the US) and I am trying to prove that "Around two-thirds of homicide victims who are males in the age-group of 15--34 are black".
Here's my attempt :
data = pd.read_csv("./guns-data-master/full_data.csv")
homicides = data[data['intent'] == 'Homicide']
male_homicides = homicides[homicides['sex'] == 'M']
less_thirty_four = male_homicides[male_homicides['age'] <= 34.0]
within_range = less_thirty_four[less_thirty_four['age'] >= 15.0]
within_range.race.value_counts()
which basically gives me enough information to prove what I want. However, I am sure that there must be an easier and more efficient way to filter out all the homicide victims which are males and between 15 and 34 years old.
What can I do to make this filtering process more efficient?
In addition to what #hypnos has mentioned, an alternative way to do it (with perhaps better readability) is to use the query method.
url = "https://raw.githubusercontent.com/fivethirtyeight/guns-data/master/full_data.csv"
df = pd.read_csv(url, index_col=[0])
df.query("age >= 25 and age <= 34 and intent == 'Homicide' and sex == 'M'") \
.race \
.value_counts()
Black 5901
White 1568
Hispanic 1564
Asian/Pacific Islander 122
Native American/Native Alaskan 90
Try this:
data = pd.read_csv("./guns-data-master/full_data.csv")
homicides = data[(data['intent'] == 'Homicide') & (data['sex'] == 'M') & (data['age'] <= 34.0) & (data['age'] >= 15.0) ]
homicides.race.value_counts()

Pandas filter rows based on multiple conditions

I have some values in the risk column that are neither, Small, Medium or High. I want to delete the rows with the value not being Small, Medium and High. I tried the following:
df = df[(df.risk == "Small") | (df.risk == "Medium") | (df.risk == "High")]
But this returns an empty DataFrame. How can I filter them correctly?
I think you want:
df = df[(df.risk.isin(["Small","Medium","High"]))]
Example:
In [5]:
import pandas as pd
df = pd.DataFrame({'risk':['Small','High','Medium','Negligible', 'Very High']})
df
Out[5]:
risk
0 Small
1 High
2 Medium
3 Negligible
4 Very High
[5 rows x 1 columns]
In [6]:
df[df.risk.isin(['Small','Medium','High'])]
Out[6]:
risk
0 Small
1 High
2 Medium
[3 rows x 1 columns]
Another nice and readable approach is the following:
small_risk = df["risk"] == "Small"
medium_risk = df["risk"] == "Medium"
high_risk = df["risk"] == "High"
Then you can use it like this:
df[small_risk | medium_risk | high_risk]
or
df[small_risk & medium_risk]
You could also use query:
df.query('risk in ["Small","Medium","High"]')
You can refer to variables in the environment by prefixing them with #. For example:
lst = ["Small","Medium","High"]
df.query("risk in #lst")
If the column name is multiple words, e.g. "risk factor", you can refer to it by surrounding it with backticks ` `:
df.query('`risk factor` in #lst')
query method comes in handy if you need to chain multiple conditions. For example, the outcome of the following filter:
df[df['risk factor'].isin(lst) & (df['value']**2 > 2) & (df['value']**2 < 5)]
can be derived using the following expression:
df.query('`risk factor` in #lst and 2 < value**2 < 5')

subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.
I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:
For now, you'll have to reference the DataFrame instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:
With query() you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})
In [10]: df
Out[10]:
gender price
0 m 89
1 f 123
2 f 100
3 m 104
4 m 98
5 m 103
6 f 100
7 f 109
8 f 95
9 m 87
In [11]: df.query('gender == "m" and price < 100')
Out[11]:
gender price
0 m 89
4 m 98
9 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
Just for someone looking for a solution more similar to R:
df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]
No need for data.loc or query, but I do think it is a bit long.
I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]
Regarding some points mentioned in previous answers, and to improve readability:
No need for data.loc or query, but I do think it is a bit long.
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.
I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.
q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time
df.loc[q_product & q_start & q_end, c('Time,Product')]
# c is just a convenience
c = lambda v: v.split(',')
Creating an Empty Dataframe with known Column Name:
Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)
Creating a dataframe from csv:
df = pd.DataFrame('...../file_name.csv')
Creating a dynamic filter to subset a dtaframe:
i = 12
df[df['ActivitiID'] <= i]
Creating a dynamic filter to subset required columns of dtaframe
df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Categories

Resources