I have an dataset that includes the below information. I'd like to write a pivot table that counts the number of days from the Date Column, and then runs sum on the Impression, Clicks, Conversions, and Budget Delivered columns. Essentially, I'd like a summary of the table
Date Impressions Clicks Conversions Budget Delivered
0 1/1/2019 11,506,995 1,672 88 $12,124.14
1 1/2/2019 9,394,458 1,516 179 $9,838.45
2 1/3/2019 4,696,388 878 129 $6,858.67
3 1/4/2019 8,987,784 1,179 107 $9,566.55
4 1/5/2019 8,923,751 1,171 88 $9,322
I am having trouble figuring out how to return this single row DataFrame. I am trying to use pivot_table method but the groupby parameter is not returning the desired result. Not sure how to approach this issue.
from datatable import dt, f, by
df = dt.Frame("""
Date Impressions Clicks Conversions Budget Delivered
1/1/2019 11,506,995 1,672 88 $12,124.14
1/2/2019 9,394,458 1,516 179 $9,838.45
1/3/2019 4,696,388 878 129 $6,858.67
1/4/2019 8,987,784 1,179 107 $9,566.55
1/5/2019 8,923,751 1,171 88 $9,322
""")
budget = df['Budget'].to_list()[0]
budget = [float(x.replace('$', '').replace(',', '')) for x in budget]
df['Budget'] = dt.Frame(budget)
df[:, dt.sum(f[1:6])]
| Impressions Clicks Conversions Budget Delivered
-- + ----------- ------ ----------- ------- ---------
0 | 43509376 6416 591 47709.8 0
The main problem is string cleaning. As it stands, your input DataFrame contains mostly strings because of the non-numeric characters like /, ,, and $. The first step is to clean the data and convert it to a summable type such as int or float. Then we can sum all rows.
For non-numeric fields that should be counted rather than summed ('Date'), we replace those summed strings with counts.
Also, not sure you need a single row DataFrame when a Series would have sufficed, but since it was in the requirements I did that too.
It's inelegant, but it works:
def clean(x):
if isinstance(x, str):
return(x.replace('$', '').replace(',', ''))
return(x)
data_df['Budget Delivered'] = data_df['Budget Delivered'].apply(clean).astype('float')
col_names_to_intify = ['Impressions', 'Clicks', 'Conversions']
for col in col_names_to_intify:
data_df[col] = data_df[col].apply(clean).astype('int')
sum_df = data_df.sum().to_frame().T
for col in data_df.columns:
if data_df[col].dtypes.str == '|O':
sum_df[col] = data_df[col].count()
which gives sum_df as
Date Impressions Clicks Conversions Budget_Delivered
0 5 43509376 6416 591 47709.81
Related
I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)
I want to make a different dataframe for those Number(Column B) where Main Date > Reported Date (see the below image). If this condition comes true then I have to make other dataframe displaying that Number Data.
Example
:- if take Number (column B) 223311, now if any main date > Reported Date, then display all the records of that Number
Here is a simple solution with Pandas. You can separate out Dataframes very easily by column values of a particular column. From there, iterate the new Dataframe, resetting for index (if you want to keep the index, use dataframe.shape instead). I appended them to a list for convenience, which could be easily extracted into labeled dataframes, or combined. Long variable names are to help comprehension.
df = pd.read_csv('forstack.csv')
list_of_dataframes = [] #A place to store each dataframe. You could also name them as you go
checked_Numbers = [] #Simply to avoid multiple of same dataframe
for aNumber in df['Number']: #For every number in the column "Number"
if(aNumber not in checked_Numbers): #While this number has not been processed
checked_Numbers.append(aNumber) #Mark as checked
df_forThisNumber = df[df.Number == aNumber].reset_index(drop=True) #"Make a different Dataframe" Per request, with new index
for index in range(0,len(df_forThisNumber)): #Parse each element of this dataframe to see if it matches criteria
if(df_forThisNumber.at[index,'Main Date'] > df_forThisNumber.at[index,'Reported Date']):
list_of_dataframes.append(df_forThisNumber) #If it matches the criteria, append it
Outputs :
Main Date Number Reported Date Fee Amount Cost Name
0 1/1/2019 223311 1/1/2019 100 12 20 11
1 1/7/2019 223311 1/1/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/2/2019 111111 1/2/2019 100 12 20 11
1 1/6/2019 111111 1/2/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/3/2019 222222 1/3/2019 100 12 20 11
1 1/8/2019 222222 1/3/2019 100 12 20 11
I am doing a pivot of values in pandas as follows-
ddp=pd.pivot_table(df, values = 'Loan.ID', index=['DPD2'], columns = 'PaymentPeriod',aggfunc='count').reset_index()
But instead of getting count of Loan.ID I want the count of Loan.ID divided by the column total for each column.
For example instead of getting values like below (I dont have the grand total row as shown in the image)-
I want the percentage as below.
How to do this in pandas.?
Ifvalues are not numeric, first cast to floats or convert non parseable to NaNs:
ddp = ddp.astype(float)
#alternative
#ddp = ddp.apply(pd.to_numeric, errors='coerce')
Then use sum for Grand Total last row:
ddp = pd.DataFrame({'2017-06': [186, 104, 2], '2017-07': [294,98,10]})
ddp.loc['Grand Total'] = ddp.sum()
print (ddp)
2017-06 2017-07
0 186 294
1 104 98
2 2 10
Grand Total 292 402
And divide all Data by last row by DataFrame.div, multiple by 100 and add percentage:
df = ddp.div(ddp.iloc[-1]).mul(100).round(2).astype(str) + '%'
print(df)
2017-06 2017-07
0 63.7% 73.13%
1 35.62% 24.38%
2 0.68% 2.49%
Grand Total 100.0% 100.0%
Of if need floats with double 00:
df = ddp.div(ddp.iloc[-1]).mul(100).round(2).applymap("{:10.02f}%".format)
print(df)
2017-06 2017-07
0 63.70% 73.13%
1 35.62% 24.38%
2 0.68% 2.49%
Grand Total 100.00% 100.00%
you can also try below code for column specific format change by style.format:
df =df.style.format({'Column1':'{:,.0%}'.format,'Column2':'{:,.1%}'.format,})
you need to include specific column name instead of 'Column' label in the above code.
let me know if this code work for you.
I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014
I have a data series,
df,
primary
Buy 484
Sell 429
Blanks 130
FX Spot 108
Income 77
FX Forward 2
trying to crate a dataframe with 2 column.
first column values should be the index of df
second column should have the values of primary in df
by using,
filter_df=pd.DataFrame({'contents':df.index, 'values':df.values})
I get,
Exception: Data must be 1-dimensional
Use reset_index with rename_axis for new column name:
filter_df = df.rename_axis('content').reset_index()
Another solution with rename:
filter_df = df.reset_index().rename(columns={'index':'content'})
For DataFrame from constructor need df['primary'] for select column
filter_df=pd.DataFrame({'contents':df.index, 'values':df['primary'].values})
print (filter_df)
content primary
0 Buy 484
1 Sell 429
2 Blanks 130
3 FX Spot 108
4 Income 77
5 FX Forward 2