Format number based on conditional - python

I am new to python and struggling with a simple formating issue. I have a table with two columns - metrics and value. I am looking to format the value based on the name of the metric ( in the metrics column ). Can't seem to get it to work. I'd like the numbers to show as #,### and metrics with the name 'Pct ..." to be #.#%. The code runs ok but no changes are made. Also, some of the values may be nulls. Not sure how to handle that.
# format numbers and percentages
pct_options = ['Pct Conversion', 'Pct Gross Churn', 'Pct Net Churn']
for x in pct_options:
if x in df['metrics']:
df.value.mul(100).astype('float64').astype(str).add('%')
else:
df.value.astype('float64')

IIUC, you can do it with isin, try
#first convert your column to float if necessary note you need to reassign the column
df.value = df.value.astype('float64')
#then change only the rows with the right metrics with a mask created with isin
mask_pct = df.metrics.isin(pct_options)
df.loc[mask_pct, 'value'] = df.loc[mask_pct, 'value'].mul(100).astype(str).add('%')
EDIT here may be wahat you want:
#example df
df = pd.DataFrame({'metrics': ['val', 'Pct Conversion', 'Pct Gross Churn', 'ind', 'Pct Net Churn'], 'value': [12345.5432, 0.23245436, 0.4, 13, 0.000004]})
print (df)
metrics value
0 val 12345.543200
1 Pct Conversion 0.232454
2 Pct Gross Churn 0.400000
3 ind 13.000000
4 Pct Net Churn 0.000004
#change the formatting with np.where
pct_options = ['Pct Conversion', 'Pct Gross Churn', 'Pct Net Churn']
df.value = np.where(df.metrics.isin(pct_options), df.value.mul(100).map('{:.2f}%'.format), df.value.map('{:,.2f}'.format))
metrics value
0 val 12,345.54
1 Pct Conversion 23.25%
2 Pct Gross Churn 40.00%
3 ind 13.00
4 Pct Net Churn 0.00%

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

How to replace slow 'apply' method in pandas DataFrame

I have a DataFrame with currencies transactions:
import pandas as pd
data = [[1653663281618, -583.8686, 'USD'],
[1653741652125, -84.0381, 'USD'],
[1653776860252, -33.8723, 'CHF'],
[1653845294504, -465.4614, 'CHF'],
[1653847155140, 22.285, 'USD'],
[1653993629537, -358.04640000000006, 'USD']]
df = pd.DataFrame(data = data, columns = ['time', 'qty', 'currency_1'])
I need to add new column "balance" which would calculate the sum of the column 'qty' for all previous transactions. I have a simple function:
def balance(row):
table = df[(df['time'] < row['time']) & (df['currency_1'] == row['currency_1'])]
return table['qty'].sum()
df['balance'] = df.apply(balance, axis = 1)
But my real DataFrame is very large and .apply method works extremely slow.
Is it a way to avoid using apply function in this case?
Something like np.where?
You could just use pandas cumsum here:
EDIT
After adding a condition:
I don't know how transform performs compared to apply, I'd say just try it on your real data. Can't think of an easier solution for the moment.
df['balance'] = df.groupby('currency_1')['qty'].transform(lambda x: x.shift().cumsum())
print(df)
time qty currency_1 balance
0 1653663281618 -583.8686 USD NaN
1 1653741652125 -84.0381 USD -583.8686
2 1653776860252 -33.8723 CHF NaN
3 1653845294504 -465.4614 CHF -33.8723
4 1653847155140 22.2850 USD -667.9067
5 1653993629537 -358.0464 USD -645.6217
old answer:
df['Balance'] = df['qty'].shift(fill_value=0).cumsum()
print(df)
time qty currency_1 Balance
0 1653663281618 -583.8686 USD 0.0000
1 1653741652125 -84.0381 USD -583.8686
2 1653776860252 -33.8723 USD -667.9067
3 1653845294504 -465.4614 USD -701.7790
4 1653847155140 22.2850 USD -1167.2404
5 1653993629537 -358.0464 USD -1144.9554

Calculations on a pandas DataFrame column conditional on another column

I notice several 'set value of new column based on value of another'-type questions, but from what I gather, I have not found that they address dividing values in the same column, based on the conditions set by another column.
The data I have is as the table below, minus the column (variable) 'healthpertotal'.
It shows (in the column 'function'), the amount of government spending (aka expenditure) on
a) health (column 'value'), and
b) its total spending (same column 'value'), and
the associated year of that spending (column 'year').
I want to make a new column that shows the percent of government health spending over its total spending, for a given year, as shown below in the column 'healthpertotal'.
So for instance, in 1995, the value of this variable is (42587(health spending amount)/326420(total spending amount))*100=13.05.
As for the rows showing total spending, the 'healthpertotal' could be 'missing', 1, or 'not applicable' and the like. I am ok with any of these options.
How would I set up this new column 'healthpertotal' using python?
A proposed table or DataFrame for what I would like to achieve follows (and its code on how it might be set up - artificially 'forced' in the case of the final variable 'healthpertotal') :
data = {'function':['Health'] * 3 + ['Total'] * 3,
'year':[1995,1996,1997,1995,1996,1997],
'value':[42587, 44209,44472,326420,333637,340252],
'healthpertotal':[13.05,13.25,13.07]+[np.nan]*3
}
df = pd.DataFrame(data)
print (df)
Expected outcome:
function year value healthpertotal
0 Health 1995 42587 13.05
1 Health 1996 44209 13.25
2 Health 1997 44472 13.07
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN
You could use groupby + transform last to transform total values to align with the DataFrame; then divide "value" with it using rdiv; then replace 100 with NaN (assuming health spending is never 100%):
df['healthpertotal'] = df.groupby('year')['value'].transform('last').rdiv(df['value']).mul(100).replace(100, np.nan)
We could also use merge + concat (calculate the percentage in between these operations):
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year')
tmp['healthpertotal'] = tmp['value_x'] / tmp['value_y'] * 100
msk = tmp.columns.str.contains('_y')
tmp1 = tmp.loc[:, ~msk]
tmp2 = tmp[tmp.columns[msk].tolist() + ['year']]
pd.concat((tmp1.set_axis(tmp1.columns.map(lambda x: x.split('_')[0]), axis=1),
tmp2.set_axis(tmp2.columns.map(lambda x: x.split('_')[0]), axis=1)))
We could also use merge + wide_to_long (calculate the percentage in between these operations) + mask the duplicates:
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year', suffixes=('0','1'))
tmp['healthpertotal'] = tmp['value0'] / tmp['value1'] * 100
out = pd.wide_to_long(tmp, stubnames=['function', 'value'], i=['year','healthpertotal'], j='').droplevel(-1).reset_index()
out['healthpertotal'] = out['healthpertotal'].mask(out['healthpertotal'].duplicated())
Output:
function year value healthpertotal
0 Health 1995 42587 13.046688
1 Health 1996 44209 13.250629
2 Health 1997 44472 13.070313
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN

Filter values as per std deviation for individual column

I am working on a requirement where I need to assign particular value to NaN based on the variable upper, which is my upper range of standard deviation
Here is a sample code-
data = {'year': ['2014','2014','2015','2014','2015','2015','2015','2014','2015'],
'month':['Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota',"Toyota"],
'make': [23,34,32,22,12,33,44,11,21]
}
df = pd.DataFrame.from_dict(data)
df=pd.pivot_table(df,index='month',columns='year',values='make',aggfunc=np.sum)
upper=df.mean() + 3*df.std()
This is just the sample data, the real data is huge, based on upper's value for every year, I need to filter the year column accordingly.
Sample I/P's:-
df-
upper std dev-
Desired O/P-
Based on the upper std deviation values in individual year, it should convert value as NaN if the value<upper.
Eg 2014 has upper=138, so only in 2014's column, if value<upper, convert it to NaN.
2014's upper value is only applicable in 2014 itself, and the same goes for 2015.
IIUC use DataFrame.lt for compare Dataframe by Series and then set NaNs if match by DataFrame.mask:
print (df.lt(upper))
year 2014 2015
month
Hyundai True True
Toyota True True
df = df.mask(df.lt(upper))
print (df)
year 2014 2015
month
Hyundai NaN NaN
Toyota NaN NaN

Show value as % of column total in pandas pivot

I am doing a pivot of values in pandas as follows-
ddp=pd.pivot_table(df, values = 'Loan.ID', index=['DPD2'], columns = 'PaymentPeriod',aggfunc='count').reset_index()
But instead of getting count of Loan.ID I want the count of Loan.ID divided by the column total for each column.
For example instead of getting values like below (I dont have the grand total row as shown in the image)-
I want the percentage as below.
How to do this in pandas.?
Ifvalues are not numeric, first cast to floats or convert non parseable to NaNs:
ddp = ddp.astype(float)
#alternative
#ddp = ddp.apply(pd.to_numeric, errors='coerce')
Then use sum for Grand Total last row:
ddp = pd.DataFrame({'2017-06': [186, 104, 2], '2017-07': [294,98,10]})
ddp.loc['Grand Total'] = ddp.sum()
print (ddp)
2017-06 2017-07
0 186 294
1 104 98
2 2 10
Grand Total 292 402
And divide all Data by last row by DataFrame.div, multiple by 100 and add percentage:
df = ddp.div(ddp.iloc[-1]).mul(100).round(2).astype(str) + '%'
print(df)
2017-06 2017-07
0 63.7% 73.13%
1 35.62% 24.38%
2 0.68% 2.49%
Grand Total 100.0% 100.0%
Of if need floats with double 00:
df = ddp.div(ddp.iloc[-1]).mul(100).round(2).applymap("{:10.02f}%".format)
print(df)
2017-06 2017-07
0 63.70% 73.13%
1 35.62% 24.38%
2 0.68% 2.49%
Grand Total 100.00% 100.00%
you can also try below code for column specific format change by style.format:
df =df.style.format({'Column1':'{:,.0%}'.format,'Column2':'{:,.1%}'.format,})
you need to include specific column name instead of 'Column' label in the above code.
let me know if this code work for you.

Categories

Resources