How to create a function for a dataframe with pandas - python

I have this data frame of clients purchases and I would like to create a function that gave me the total purchases for a given input of month and year.
I have a dataframe (df) with lots of columns but i'm going to use only 3 ("year", "month", "value")
This is what I'm trying but not working:
def total_purchases():
y = input('Which year do you want to consult?')
m = int(input('Which month do you want to consult?')
sum = []
if df[df['year']== y] & df[df['month']== m]:
for i in df:
sum = sum + df[df['value']]
return sum

You're close, you need to ditch the IF statement and the For loop.
additionally, when dealing with multiple logical operators in pandas you need to use parenthesis to seperate the conditions.
def total_purchases(df):
y = input('Which year do you want to consult? ')
m = int(input('Which month do you want to consult? '))
return df[(df['year'].eq(y)) & (df['month'].eq(m))]['value'].sum()
setup
df_p = pd.DataFrame({'year' : ['2011','2011','2012','2013'],
'month' : [1,2,1,2],
'value' : [200,500,700,900]})
Test
total_purchases(df_p)
Which year do you want to consult? 2011
Which month do you want to consult? 2
500

Related

How can you increase the speed of an algorithm that computes a usage streak?

I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak

Python: formula to calculate cumulate and monthly value of a list

I have two request for you:
First one:
I have the following montly income variable given by the following:
total_income={'Revenues': list(100 for m in range(12))}
I want to extract (based on the today date) the month income. Ad example with the formula:
this_month=datetime.datetime.now()
this_month=now.month
I want to extract from total_income the value of this month. Ad example if this_month=6 I want to have the income of this month.
Second one
Also based on the this_month variable I want to get the cumulate sum of all month until it. Ad example if this_month=6 I want to have the sum of all income starting from 1 to 6.
Maybe something like that:
import datetime
total_income = {'Revenues': [100 for m in range(12)]}
current_month = datetime.date.today().month
current_revenue = total_income['Revenues'][current_month - 1]
cumulative_revenue = sum(total_income['Revenues'][:current_month])
You can implement this using the following function:
#function definition
def find_income(total_income):
this_month=datetime.datetime.now().month
total_income_list = total_income["Revenues"]
this_month_income = total_income_list[this_month -1]
sum = 0
for m in range(this_month):
sum += total_income_list[m]
return this_month_income, sum
#Call the function
this_month_income, sum = find_income(total_income)
print(this_month_income) #100
print(sum) #600

Conditional operations in dataframe (if else)

I have a data frame called Install_Date. I want to assign values to another data frame called age under two conditions- if value in Install_Date is null then age = current year - plant construct date, if value is not null then age = current year - INPUT_Asset["Install_Date"],
This is the code I have. First condition works fine but the second condition still gives 0 as values. :
Plant_Construct_Year = 1975
this_year= 2020
for i in INPUT_Asset["Install_Date"]:
if i != 0.0:
INPUT_Asset["Asset_Age"] = this_year- INPUT_Asset["Install_Date"]
else
INPUT_Asset["Asset_Age"] = this_year- Plant_Construct_Year
INPUT_Asset["Install_Date"] = pd.to_numeric(INPUT_Asset["Install_Date"], errors='coerce').fillna(0)
INPUT_Asset["Asset_Age"] = np.where(INPUT_Asset["Install_Date"] ==0.0, this_year- Plant_Construct_Year,INPUT_Asset["Asset_Age"])
INPUT_Asset["Asset_Age"] = np.where(INPUT_Asset["Install_Date"] !=0.0, this_year- INPUT_Asset["Install_Date"],INPUT_Asset["Asset_Age"])
print(INPUT_Asset["Asset_Age"])

User input to control output

data['Year'] = input("Select a Year: ")
data['Month'] = input("Select a Month: ")
grouping = data.groupby(["Year", "Month"])
monthly_averages = grouping.aggregate({"Value":np.mean})
print(monthly_averages)
Guys - trying to pick a year, and a month, then show the mean value for that month. The last 3 lines alone will show every year and month average, but I want to be able to select one. New to python, not sure how to apply the choice to the grouping.
Do you have an example table you can show us? Something like this should work but I can't test it without an example. I'd recommend reading up on the loc method.
year = input('Select a year: ')
month = input('Select a month: ')
df2 = data.loc['Year' == year]
df3 = df2.loc['Month' == month]
grouping = df3.groupby(["Year", "Month"])
monthly_averages = grouping.aggregate({"Value":np.mean})
print(monthly_averages)
I don't think you need grouping if your picking one month and one year
df[(df['Year'] == 'input_year') & (df['Month'] == 'input_month')].mean()

Spliting DataFrame into Multiple Frames by Dates Python

I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}

Categories

Resources