Calculations on a pandas DataFrame column conditional on another column - python

I notice several 'set value of new column based on value of another'-type questions, but from what I gather, I have not found that they address dividing values in the same column, based on the conditions set by another column.
The data I have is as the table below, minus the column (variable) 'healthpertotal'.
It shows (in the column 'function'), the amount of government spending (aka expenditure) on
a) health (column 'value'), and
b) its total spending (same column 'value'), and
the associated year of that spending (column 'year').
I want to make a new column that shows the percent of government health spending over its total spending, for a given year, as shown below in the column 'healthpertotal'.
So for instance, in 1995, the value of this variable is (42587(health spending amount)/326420(total spending amount))*100=13.05.
As for the rows showing total spending, the 'healthpertotal' could be 'missing', 1, or 'not applicable' and the like. I am ok with any of these options.
How would I set up this new column 'healthpertotal' using python?
A proposed table or DataFrame for what I would like to achieve follows (and its code on how it might be set up - artificially 'forced' in the case of the final variable 'healthpertotal') :
data = {'function':['Health'] * 3 + ['Total'] * 3,
'year':[1995,1996,1997,1995,1996,1997],
'value':[42587, 44209,44472,326420,333637,340252],
'healthpertotal':[13.05,13.25,13.07]+[np.nan]*3
}
df = pd.DataFrame(data)
print (df)
Expected outcome:
function year value healthpertotal
0 Health 1995 42587 13.05
1 Health 1996 44209 13.25
2 Health 1997 44472 13.07
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN

You could use groupby + transform last to transform total values to align with the DataFrame; then divide "value" with it using rdiv; then replace 100 with NaN (assuming health spending is never 100%):
df['healthpertotal'] = df.groupby('year')['value'].transform('last').rdiv(df['value']).mul(100).replace(100, np.nan)
We could also use merge + concat (calculate the percentage in between these operations):
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year')
tmp['healthpertotal'] = tmp['value_x'] / tmp['value_y'] * 100
msk = tmp.columns.str.contains('_y')
tmp1 = tmp.loc[:, ~msk]
tmp2 = tmp[tmp.columns[msk].tolist() + ['year']]
pd.concat((tmp1.set_axis(tmp1.columns.map(lambda x: x.split('_')[0]), axis=1),
tmp2.set_axis(tmp2.columns.map(lambda x: x.split('_')[0]), axis=1)))
We could also use merge + wide_to_long (calculate the percentage in between these operations) + mask the duplicates:
tmp = df.loc[df['function']=='Health'].merge(df.loc[df['function']=='Total'], on='year', suffixes=('0','1'))
tmp['healthpertotal'] = tmp['value0'] / tmp['value1'] * 100
out = pd.wide_to_long(tmp, stubnames=['function', 'value'], i=['year','healthpertotal'], j='').droplevel(-1).reset_index()
out['healthpertotal'] = out['healthpertotal'].mask(out['healthpertotal'].duplicated())
Output:
function year value healthpertotal
0 Health 1995 42587 13.046688
1 Health 1996 44209 13.250629
2 Health 1997 44472 13.070313
3 Total 1995 326420 NaN
4 Total 1996 333637 NaN
5 Total 1997 340252 NaN

Related

How to join different dataframe with specific criteria?

In my MySQL database stocks, I have 5 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.
This is how I want to display: https://www.dropbox.com/s/uv1iik6m0u23gxp/ExpectedoutputFULL.csv?dl=0
So each ticker is a row that contains all of the specific columns from my tables.
Additional info:
I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed
The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.
If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.
1st table company_info: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:
2nd table income_statement_q: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:
3rd table income_statement_y: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:
4th table earnings_q:
https://www.dropbox.com/s/bufh7c2jq7veie9/earnings_q.csv?dl=0 contains quarterly data:
5th table earnings_y:
https://www.dropbox.com/s/li0r5n7mwpq28as/earnings_y.csv?dl=0
contains yearly data:
You can use:
# Convert as datetime64 if necessary
df2['date'] = pd.to_datetime(df2['date']) # quarterly
df3['date'] = pd.to_datetime(df3['date']) # yearly
# Realign date according period: 2022-06-30 -> 2022-12-31 for yearly
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# Get end dates
qmax = df2['date'].max()
ymax = df3['date'].max()
# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# Filter and reshape dataframes
qdf = (df2[df2['date'].isin(qdti)]
.assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
.assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
# Create the expected dataframe
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()
Output:
>>> out
ticker industry sector pe roe shares ... 2022Q4 2018 2019 2020 2021 2022
0 ADUS Health Care Providers & Services Health Care 38.06 7.56 16110400 ... NaN 1.737700e+07 2.581100e+07 3.313300e+07 4.512600e+07 NaN
1 BA Aerospace & Defense Industrials NaN 0.00 598240000 ... -663000000.0 1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2 CAH Health Care Providers & Services Health Care NaN 0.00 257639000 ... -130000000.0 2.590000e+08 1.365000e+09 -3.691000e+09 6.120000e+08 -9.320000e+08
3 CVRX Health Care Equipment & Supplies Health Care 0.26 -32.50 20633700 ... -10536000.0 NaN NaN NaN -4.307800e+07 -4.142800e+07
4 IMCR Biotechnology Health Care NaN -22.30 47905000 ... NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08 NaN
5 NVEC Semiconductors & Semiconductor Equipment Information Technology 20.09 28.10 4830800 ... 4231324.0 1.391267e+07 1.450794e+07 1.452664e+07 1.169438e+07 1.450750e+07
6 PEPG Biotechnology Health Care NaN -36.80 23631900 ... NaN NaN NaN -1.889000e+06 -2.728100e+07 NaN
7 VRDN Biotechnology Health Care NaN -36.80 40248200 ... NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07 NaN
[8 rows x 20 columns]

Filter values as per std deviation for individual column

I am working on a requirement where I need to assign particular value to NaN based on the variable upper, which is my upper range of standard deviation
Here is a sample code-
data = {'year': ['2014','2014','2015','2014','2015','2015','2015','2014','2015'],
'month':['Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota',"Toyota"],
'make': [23,34,32,22,12,33,44,11,21]
}
df = pd.DataFrame.from_dict(data)
df=pd.pivot_table(df,index='month',columns='year',values='make',aggfunc=np.sum)
upper=df.mean() + 3*df.std()
This is just the sample data, the real data is huge, based on upper's value for every year, I need to filter the year column accordingly.
Sample I/P's:-
df-
upper std dev-
Desired O/P-
Based on the upper std deviation values in individual year, it should convert value as NaN if the value<upper.
Eg 2014 has upper=138, so only in 2014's column, if value<upper, convert it to NaN.
2014's upper value is only applicable in 2014 itself, and the same goes for 2015.
IIUC use DataFrame.lt for compare Dataframe by Series and then set NaNs if match by DataFrame.mask:
print (df.lt(upper))
year 2014 2015
month
Hyundai True True
Toyota True True
df = df.mask(df.lt(upper))
print (df)
year 2014 2015
month
Hyundai NaN NaN
Toyota NaN NaN

Finding the highest value in a column for a given range in another column

I'm quite new to pandas and dataframes. I want to find the product ('product') from a dataframe that gave the highest income ('income') in the years ('year') 1990 to 1999.
My best attempt only gives me the row number from the dataframe and the income, although I want it to show all other columns as well.
This was my best attempt:
HighestIncome90s = df.head(1)
HighestIncome90s = df.loc[(df['year'] >= 1990) & (df['year'] <= 1999), 'income'].nlargest()
Let us try fixed your code with sort_values
df = df.sort_values('income',ascending=False)
HighestIncome90s = df
HighestIncome90s = df.loc[(df['year'] >= 1990) & (df['year'] <= 1999), 'income'].head(1)
If you would like get all column
Allcol = df.loc[(df['year'] >= 1990) & (df['year'] <= 1999),].head(1)
I want it to show all other columns as well.
If you use
idxmax,
as in
max_income_idx = df.income[(df['year'] >= 1990) & (df['year'] <= 1999)].idxmax()
then it will be the index of the largest relevant income. You can then use it with df.loc[max_income_idx, :] to get all columns for that.
As I understood, your source DataFrame contains income data for
each product and year, something like:
year product income
0 1980 P1 120.15
1 1990 P1 120.15
2 1992 P1 140.20
3 1994 P1 160.51
4 1996 P1 171.04
5 1988 P2 140.17
6 1991 P2 145.17
7 1993 P2 160.42
8 1995 P2 181.73
9 1989 P3 140.17
10 1992 P3 175.17
11 1994 P3 240.42
12 1996 P3 315.73
But you are interested only in rows for year between 1990 and 1999.
Then, you want to sum the income for each product (for the whole 10 year period).
The code to do it is:
wrk = df.query('year.between(1990,1999)').groupby('product').income.sum()
For the time being, for the above source data, we have the following
Series:
product
P1 591.90
P2 487.32
P3 731.32
Name: income, dtype: float64
(the left column is the index and the right - total income for each
product).
And to get the final result (the "best-seller" product and the total income
that it brought) run:
result = wrk.sort_values(ascending=False).head(1)
It is also a Series, but containing only one element:
product
P3 731.32
Name: income, dtype: float64
(P3 is the index and 731.32 is the total income).
All other solutions (presented so far) give the biggest income for
a single year (within the period of interest), not the total
income in this period.
First Step : this code use to limit the year from 1990 to 1999
df = df.query('year >= 1990 & age <= 1999')
Second Step : after that use this code to sort income from the highest value
df = df.sort_values('income', ascending=False)

Format number based on conditional

I am new to python and struggling with a simple formating issue. I have a table with two columns - metrics and value. I am looking to format the value based on the name of the metric ( in the metrics column ). Can't seem to get it to work. I'd like the numbers to show as #,### and metrics with the name 'Pct ..." to be #.#%. The code runs ok but no changes are made. Also, some of the values may be nulls. Not sure how to handle that.
# format numbers and percentages
pct_options = ['Pct Conversion', 'Pct Gross Churn', 'Pct Net Churn']
for x in pct_options:
if x in df['metrics']:
df.value.mul(100).astype('float64').astype(str).add('%')
else:
df.value.astype('float64')
IIUC, you can do it with isin, try
#first convert your column to float if necessary note you need to reassign the column
df.value = df.value.astype('float64')
#then change only the rows with the right metrics with a mask created with isin
mask_pct = df.metrics.isin(pct_options)
df.loc[mask_pct, 'value'] = df.loc[mask_pct, 'value'].mul(100).astype(str).add('%')
EDIT here may be wahat you want:
#example df
df = pd.DataFrame({'metrics': ['val', 'Pct Conversion', 'Pct Gross Churn', 'ind', 'Pct Net Churn'], 'value': [12345.5432, 0.23245436, 0.4, 13, 0.000004]})
print (df)
metrics value
0 val 12345.543200
1 Pct Conversion 0.232454
2 Pct Gross Churn 0.400000
3 ind 13.000000
4 Pct Net Churn 0.000004
#change the formatting with np.where
pct_options = ['Pct Conversion', 'Pct Gross Churn', 'Pct Net Churn']
df.value = np.where(df.metrics.isin(pct_options), df.value.mul(100).map('{:.2f}%'.format), df.value.map('{:,.2f}'.format))
metrics value
0 val 12,345.54
1 Pct Conversion 23.25%
2 Pct Gross Churn 40.00%
3 ind 13.00
4 Pct Net Churn 0.00%

How to convert this to a for-loop with an output to CSV

I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938

Categories

Resources