I am doing a pivot of values in pandas as follows-
ddp=pd.pivot_table(df, values = 'Loan.ID', index=['DPD2'], columns = 'PaymentPeriod',aggfunc='count').reset_index()
But instead of getting count of Loan.ID I want the count of Loan.ID divided by the column total for each column.
For example instead of getting values like below (I dont have the grand total row as shown in the image)-
I want the percentage as below.
How to do this in pandas.?
Ifvalues are not numeric, first cast to floats or convert non parseable to NaNs:
ddp = ddp.astype(float)
#alternative
#ddp = ddp.apply(pd.to_numeric, errors='coerce')
Then use sum for Grand Total last row:
ddp = pd.DataFrame({'2017-06': [186, 104, 2], '2017-07': [294,98,10]})
ddp.loc['Grand Total'] = ddp.sum()
print (ddp)
2017-06 2017-07
0 186 294
1 104 98
2 2 10
Grand Total 292 402
And divide all Data by last row by DataFrame.div, multiple by 100 and add percentage:
df = ddp.div(ddp.iloc[-1]).mul(100).round(2).astype(str) + '%'
print(df)
2017-06 2017-07
0 63.7% 73.13%
1 35.62% 24.38%
2 0.68% 2.49%
Grand Total 100.0% 100.0%
Of if need floats with double 00:
df = ddp.div(ddp.iloc[-1]).mul(100).round(2).applymap("{:10.02f}%".format)
print(df)
2017-06 2017-07
0 63.70% 73.13%
1 35.62% 24.38%
2 0.68% 2.49%
Grand Total 100.00% 100.00%
you can also try below code for column specific format change by style.format:
df =df.style.format({'Column1':'{:,.0%}'.format,'Column2':'{:,.1%}'.format,})
you need to include specific column name instead of 'Column' label in the above code.
let me know if this code work for you.
Related
I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)
i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.
I have an dataset that includes the below information. I'd like to write a pivot table that counts the number of days from the Date Column, and then runs sum on the Impression, Clicks, Conversions, and Budget Delivered columns. Essentially, I'd like a summary of the table
Date Impressions Clicks Conversions Budget Delivered
0 1/1/2019 11,506,995 1,672 88 $12,124.14
1 1/2/2019 9,394,458 1,516 179 $9,838.45
2 1/3/2019 4,696,388 878 129 $6,858.67
3 1/4/2019 8,987,784 1,179 107 $9,566.55
4 1/5/2019 8,923,751 1,171 88 $9,322
I am having trouble figuring out how to return this single row DataFrame. I am trying to use pivot_table method but the groupby parameter is not returning the desired result. Not sure how to approach this issue.
from datatable import dt, f, by
df = dt.Frame("""
Date Impressions Clicks Conversions Budget Delivered
1/1/2019 11,506,995 1,672 88 $12,124.14
1/2/2019 9,394,458 1,516 179 $9,838.45
1/3/2019 4,696,388 878 129 $6,858.67
1/4/2019 8,987,784 1,179 107 $9,566.55
1/5/2019 8,923,751 1,171 88 $9,322
""")
budget = df['Budget'].to_list()[0]
budget = [float(x.replace('$', '').replace(',', '')) for x in budget]
df['Budget'] = dt.Frame(budget)
df[:, dt.sum(f[1:6])]
| Impressions Clicks Conversions Budget Delivered
-- + ----------- ------ ----------- ------- ---------
0 | 43509376 6416 591 47709.8 0
The main problem is string cleaning. As it stands, your input DataFrame contains mostly strings because of the non-numeric characters like /, ,, and $. The first step is to clean the data and convert it to a summable type such as int or float. Then we can sum all rows.
For non-numeric fields that should be counted rather than summed ('Date'), we replace those summed strings with counts.
Also, not sure you need a single row DataFrame when a Series would have sufficed, but since it was in the requirements I did that too.
It's inelegant, but it works:
def clean(x):
if isinstance(x, str):
return(x.replace('$', '').replace(',', ''))
return(x)
data_df['Budget Delivered'] = data_df['Budget Delivered'].apply(clean).astype('float')
col_names_to_intify = ['Impressions', 'Clicks', 'Conversions']
for col in col_names_to_intify:
data_df[col] = data_df[col].apply(clean).astype('int')
sum_df = data_df.sum().to_frame().T
for col in data_df.columns:
if data_df[col].dtypes.str == '|O':
sum_df[col] = data_df[col].count()
which gives sum_df as
Date Impressions Clicks Conversions Budget_Delivered
0 5 43509376 6416 591 47709.81
I have a dataframe of daily license_type activations (either full or trial) as shown below. Basically, I am trying to see the monthly count of Trial to Full License conversions. I am trying to do this by taking into consideration the daily data and the user_email column.
Date User_Email License_Type P.Letter Month (conversions)
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
4 2017-04-08 761179767639020420 full g 2017-04
The logic I have is to iteratively check the User_Email column. If the User_Email value is a duplicate, then check license_type column. If value in license_type = 'full' return 1 in a new column called 'Conversions' else return 0 in 'conversion' column. This would be the amendment to the original dataframe above.
Then group 'Date' column by month and I should have a aggregate value of monthly conversions in 'Conversion' column? Should look something like below:
Date
2017-Apr 1
2017-Feb 2
2017-Jan 1
2017-Jul 0
2017-Mar 1
Name: Conversion
below was my trial at getting the desire output above
#attempt to create a new column Conversion and fill with 1 and 0 for if converted or not.
for values in df['User_email']:
if value.is_unique:
df['Conversion'] = 0 #because there is no chance to go from trial to Full
else:
if df['License_type'] = 'full': #check if license type is full
df['Conversion'] = 1 #if full, I assume it was originally trial and now is full
# Grouping daily data by month to get monthly total of conversions
converted = df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
Your sample data doesn't have the features you note you are looking for. Rather than loop (always a pandas anti-pattern) have a simple function that operates row by row
for uniqueness test I'm getting a count of use of email address first and setting the number of times it occurs on each row
your logic I've transcribed in a slightly different way.
data = """ Date User_Email License_Type P.Letter Month
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
3 2017-03-13 2475366081966194134 full c 2017-03
3 2017-03-13 2475366081966194 full c 2017-03
4 2017-04-08 761179767639020420 full g 2017-04"""
a = [[t.strip() for t in re.split(" ",l) if t.strip()!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])
df["Date"] = pd.to_datetime(df["Date"])
df = df.assign(
emailc=df.groupby("User_Email")["User_Email"].transform("count"),
Conversion=lambda dfa: dfa.apply(lambda r: 0 if r["emailc"]==1 or r["License_Type"]=="trial" else 1, axis=1)
).drop("emailc", axis=1)
df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
output
Date
2017-Apr 0
2017-Feb 1
2017-Jan 0
2017-Jul 0
2017-Mar 1
Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848