Data Snippet
I am trying to add a new column to my data frame that displays the average purchase amount per user. The data frame is called trainDf and the below line of code produces the average by user. I'm trying to learn how to add it as a column to display similar to the above image.
AveragePurchaseAmountUser = trainDf.groupby(by='User_ID')['Purchase_Amount'].mean()
Thank you in advance!
You can try:
trainDf['AveragePurchaseAmountUser'] = trainDf.groupby(['User_ID'])['Purchase_Amount'].mean()
I would use merge
avg_df = trainDf.groupby(by='User_ID')['Purchase_Amount'].mean().reset_index().rename(columns={'Purchase_Amount': 'Avg'})
trainDf = trainDf.merge(avg_df, on='User_ID')
This will return the DataFrame with the new column
def avg(df):
df['Average_Purchase_Amount'] = df['Purchase_Amount'].mean()
return df
newDf = trainDf.groupby(by='User_ID').apply(avg)
And if you want the column as a Series you can apply this function:
def avgSeries(df):
return pd.Series(data = df['Purchase_Amount'].mean(), index = df.index)
Then add the column to you DataFrame later
This is what transform is for
AveragePurchaseAmountUser = trainDf.groupby(by='User_ID')['Purchase_Amount'].transform() .mean()
I can't test atm, but you might need
...transform('mean')
Instead
Related
I'm currently struggling with a problem of which I try not to use for loops (even though that would make it easier for me to understand) and instead use the 'pandas' approach.
The problem I'm facing is that I have a big dataframe of logs, allLogs, like:
index message date_time user_id
0 message1 2023-01-01 09:00:49 123
1 message2 2023-01-01 09:00:58 123
2 message3 2023-01-01 09:01:03 125
... etc
I'm doing analysis per user_id, for which I've written a function. This function needs a subset of the allLogs dataframe: all id's, messages ande date_times per user_id. Think of it like: for each unique user_id I want to run the function.
This function calculates the date-times between each message and makes a Series with all those time-delta's (time differences). I want to make this into a separate dataframe, for which I have a big list/series/array of time-delta's for each unique user_id.
The current function looks like this:
def makeSeriesPerUser(df):
df = df[['message','date_time']]
df = df.drop_duplicates(['date_time','message'])
df = df.sort_values(by='date_time', inplace = True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~(m1)]
df = (df['date_time'].shift(-1) - df['date_time'])
df = df.reset_index(drop=True)
seconds = m1.astype('timedelta64[s]')
return seconds
And I use allLogs.groupby('user_id').apply(lambda x: makeSeriesPerUser(x)) to apply it to my user_id groups.
How do I, instead of returning something and adding it to the existing dataframe, make a new dataframe with for each unique user_id a series of these time-delta's (each user has different amounts of logs)?
You should just create a dict where the keys are the user IDs and the values are the relevant DataFrames per user. There is no need to keep everything in one giant DataFrame, unless you have millions of users with only a few records apiece.
First off, you should use chaining. It's much simpler to read.
Secondly, the pd.DataFrame.groupby().apply can take the function itself. No lambda function is required.
Your sort_values(inplace=True) is returning None. Removing this will return the sorted DataFrame.
def makeSeriesPerUser(df):
df = df[['message','date_time']]
df = df.drop_duplicates(['date_time','message'])
df = df.sort_values(by='date_time', inplace = True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~(m1)]
df = (df['date_time'].shift(-1) - df['date_time'])
df = df.reset_index(drop=True)
seconds = m1.astype('timedelta64[s]')
return seconds
Turns into
def extract_timedelta(df_grouped_by_user: pd.DataFrame) -> Series:
selected_columns = ['message', 'date_time']
time_delta = (df_grouped_by_user[selected_columns]
.drop_duplicates(selected_columns) # drop duplicate entries
['date_time'] # select date_time column
.sort_values() # sort values of selected date_time column
.diff() # take difference
.astype('timedelta64[s]') # as type
.reset_index(drop=True)
)
return time_delta
time_delta_df = df.groupby('user_id').apply(extract_timedelta)
This returns a dataframe of timedeltas and is grouped by each user_id. The grouped dataframe is actually just a series with a MultiIndex. This index is just a tuple['user_id', int].
If you want a new dataframe with users as columns, then you want to this
data = {group_name: extract_timedelta(group_df) for group_name, group_df in messages_df.groupby('user_id')}
time_delta_df = pd.DataFrame(data)
I have the following dataframe:
Dataframe
Now i want to find the average of every column and create a new dataframe with the result.
My only solution has been:
#convert all rows to mean of values in column
df_find_mean['Germany'] = (df_find_mean["Germany"].mean())
df_find_mean['Turkey'] = (df_find_mean["Turkey"].mean())
df_find_mean['USA_NJ'] = (df_find_mean["USA_NJ"].mean())
df_find_mean['USA_TX'] = (df_find_mean["USA_TX"].mean())
df_find_mean['France'] = (df_find_mean["France"].mean())
df_find_mean['Sweden'] = (df_find_mean["Sweden"].mean())
df_find_mean['Italy'] = (df_find_mean["Italy"].mean())
df_find_mean['SouthAfrica'] = (df_find_mean["SouthAfrica"].mean())
df_find_mean['Taiwan'] = (df_find_mean["Taiwan"].mean())
df_find_mean['Hungary'] = (df_find_mean["Hungary"].mean())
df_find_mean['Portugal'] = (df_find_mean["Portugal"].mean())
df_find_mean['Croatia'] = (df_find_mean["Croatia"].mean())
df_find_mean['Albania'] = (df_find_mean["Albania"].mean())
df_find_mean['England'] = (df_find_mean["England"].mean())
df_find_mean['Switzerland'] = (df_find_mean["Switzerland"].mean())
df_find_mean['Denmark'] = (df_find_mean["Denmark"].mean())
#Remove all rows except first
df_find_mean = df_find_mean.loc[[0]]
#Verify data
display(df_find_mean)
Which works, but is not very elegant.
Is there some way to iterate over each column and construct a new dataframe as the average (.mean()) of that colume?
Expected output:
Dataframe with average of columns from previous dataframes
Use DataFrame.mean with convert Series to one row DataFrame by Series.to_frame and transpose:
df = df_find_mean.mean().to_frame().T
display(df)
Just use DataFrame.mean() to compute the mean of all your columns:
You can compute the mean of each column by df_find_mean.mean() and then integrate this into pd.DataFrame([df_find_mean.mean()])!
means = df_find_mean.mean()
df_mean = pd.DataFrame([means])
display(df_mean)
I want to compute week of the month for a specified date. For computing week of the month, I currently use the user-defined function.
Input data frame:
Output data frame:
Here is what I have tried:
from math import ceil
def week_of_month(dt):
"""
Returns the week of the month for the specified date.
"""
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(ceil(adjusted_dom/7.0))
After this,
import pandas as pd
df = pd.read_csv("input_dataframe.csv")
df.date = pd.to_datetime(df.date)
df['year_of_date'] = df.date.dt.year
df['month_of_date'] = df.date.dt.month
df['day_of_date'] = df.date.dt.day
wom = pd.Series()
# worker function for creating week of month series
def convert_date(t):
global wom
wom = wom.append(pd.Series(week_of_month(datetime.datetime(t[0],t[1],t[2]))), ignore_index = True)
# calling worker function for each row of dataframe
_ = df[['year_of_date','month_of_date','day_of_date']].apply(convert_date, axis = 1)
# adding new computed column to dataframe
df['week_of_month'] = wom
# here this updated dataframe should look like Output data frame.
What this does is for each row of data frame it computes week of the month using given function. It makes computations slower as the data frame grows to more rows. Because currently I have more than 10M+ rows.
I am looking for a faster way of doing this. What changes can I make to this code to vectorize this operation across all rows?
Thanks in advance.
Edit: What worked for me after reading answers is below code,
first_day_of_month = pd.to_datetime(df.date.values.astype('datetime64[M]'))
df['week_of_month'] = np.ceil((df.date.dt.day + first_day_of_month.weekday) / 7.0).astype(int)
The week_of_month method can be vectorized. It could be beneficial to not do the conversion to datetime objects, and instead use pandas only methods.
first_day_of_month = df.date.to_period("M").to_timestamp()
df["week_of_month"] = np.ceil((data.day + first_day_of_month.weekday) / 7.0).astype(int)
just right off the bat without even going into your code and mentioning X/Y problems, etc.:
try to get a list of unique dates, I'm sure in the 10M rows you have more than one is a duplicate.
Steps:
create a 2nd df that contains only the columns you need and no
duplicates (drop_duplicates)
run your function on the small dataframe
merge the large and small dfs
(optional) drop the small one
data = pd.read_csv("file.csv")
As = data.groupby('A')
for name, group in As:
current_column = group.iloc[:, i]
current_column.iloc[0] = np.NAN
The problem: 'data' stays the same after this loop, even though I'm trying to set values to np.NAN .
As #ohduran suggested:
data = pd.read_csv("file.csv")
As = data.groupby('A')
new_data = pd.DataFrame()
for name, group in As:
# edit grouped data
# eg group.loc[:,'column'] = np.nan
new_data = new_data.append(group)
.groupby() does not change the initial DataFrame. You might want to store what you do with groupby() on a different variable, and the accumulate it in a different DataFrame using that for loop?
I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out