Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill - python

I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.

I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()

Related

How to join different dataframe with specific criteria?

In my MySQL database stocks, I have 5 different tables. I want to join all of those tables to display the EXACT format that I want to see. Should I join in mysql first, or should I first extract each table as a dataframe and then join with pandas? How should it be done? I don't know the code also.
This is how I want to display: https://www.dropbox.com/s/uv1iik6m0u23gxp/ExpectedoutputFULL.csv?dl=0
So each ticker is a row that contains all of the specific columns from my tables.
Additional info:
I only need the most recent 8 quarters for quarterly and 5 years for yearly to be displayed
The exact date for different tickers for quarterly data may differ. If done by hand, the most recent eight quarters can be easily copied and pasted into the respective columns, but I have no idea how to do it with a computer to determine which quarter it belongs to and show it in the same column as my example output. (I use the terms q1 through q8 simply as column names to display. So, if my most recent data is May 30, q8 is not necessarily the final quarter of the second year.
If the most recent quarter or year for one ticker is not available (as in "ADUS" in the example), but it is available for other tickers such as "BA" in the example, simply leave that one blank.
1st table company_info: https://www.dropbox.com/s/g95tkczviu84pnz/company_info.csv?dl=0 contains company info data:
2nd table income_statement_q: https://www.dropbox.com/s/znf3ljlz4y24x7u/income_statement_q.csv?dl=0 contains quarterly data:
3rd table income_statement_y: https://www.dropbox.com/s/zpq79p8lbayqrzn/income_statement_y.csv?dl=0 contains yearly data:
4th table earnings_q:
https://www.dropbox.com/s/bufh7c2jq7veie9/earnings_q.csv?dl=0 contains quarterly data:
5th table earnings_y:
https://www.dropbox.com/s/li0r5n7mwpq28as/earnings_y.csv?dl=0
contains yearly data:
You can use:
# Convert as datetime64 if necessary
df2['date'] = pd.to_datetime(df2['date']) # quarterly
df3['date'] = pd.to_datetime(df3['date']) # yearly
# Realign date according period: 2022-06-30 -> 2022-12-31 for yearly
df2['date'] += pd.offsets.QuarterEnd(0)
df3['date'] += pd.offsets.YearEnd(0)
# Get end dates
qmax = df2['date'].max()
ymax = df3['date'].max()
# Create date range (8 periods for Q, 5 periods for Y)
qdti = pd.date_range( qmax - pd.offsets.QuarterEnd(7), qmax, freq='Q')
ydti = pd.date_range( ymax - pd.offsets.YearEnd(4), ymax, freq='Y')
# Filter and reshape dataframes
qdf = (df2[df2['date'].isin(qdti)]
.assign(date=lambda x: x['date'].dt.to_period('Q').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
ydf = (df3[df3['date'].isin(ydti)]
.assign(date=lambda x: x['date'].dt.to_period('Y').astype(str))
.pivot(index='ticker', columns='date', values='netIncome'))
# Create the expected dataframe
out = pd.concat([df1.set_index('ticker'), qdf, ydf], axis=1).reset_index()
Output:
>>> out
ticker industry sector pe roe shares ... 2022Q4 2018 2019 2020 2021 2022
0 ADUS Health Care Providers & Services Health Care 38.06 7.56 16110400 ... NaN 1.737700e+07 2.581100e+07 3.313300e+07 4.512600e+07 NaN
1 BA Aerospace & Defense Industrials NaN 0.00 598240000 ... -663000000.0 1.046000e+10 -6.360000e+08 -1.194100e+10 -4.290000e+09 -5.053000e+09
2 CAH Health Care Providers & Services Health Care NaN 0.00 257639000 ... -130000000.0 2.590000e+08 1.365000e+09 -3.691000e+09 6.120000e+08 -9.320000e+08
3 CVRX Health Care Equipment & Supplies Health Care 0.26 -32.50 20633700 ... -10536000.0 NaN NaN NaN -4.307800e+07 -4.142800e+07
4 IMCR Biotechnology Health Care NaN -22.30 47905000 ... NaN -7.163000e+07 -1.039310e+08 -7.409300e+07 -1.315230e+08 NaN
5 NVEC Semiconductors & Semiconductor Equipment Information Technology 20.09 28.10 4830800 ... 4231324.0 1.391267e+07 1.450794e+07 1.452664e+07 1.169438e+07 1.450750e+07
6 PEPG Biotechnology Health Care NaN -36.80 23631900 ... NaN NaN NaN -1.889000e+06 -2.728100e+07 NaN
7 VRDN Biotechnology Health Care NaN -36.80 40248200 ... NaN -2.210300e+07 -2.877300e+07 -1.279150e+08 -5.501300e+07 NaN
[8 rows x 20 columns]

Count ratios conditional on 2 columns

I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?
The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000
Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:

Python function to iterate over a column and calculate the forumla

i have a data set like this :
YEAR MONTH VALUE
2018 3 59.507
2018 3 26.03
2018 5 6.489
2018 2 -3.181
i am trying to perform a calculation like
((VALUE1 + 1) * (VALUE2 + 1) * (VALUE3+1).. * (VALUEn +1)-1) over VALUE column
Whats the best way to accomplish this?
Use:
df['VALUE'].add(1).prod()-1
#-26714.522733572892
If you want cumulative product to create a new column use Series.cumprod:
df['new_column']=df['VALUE'].add(1).cumprod().sub(1)
print(df)
YEAR MONTH VALUE new_column
0 2018 3 59.507 59.507000
1 2018 3 26.030 1634.504210
2 2018 5 6.489 12247.291029
3 2018 2 -3.181 -26714.522734
I think you're after...
cum_prod = (1 + df['VALUE'].cumprod()) - 1
First you should understand the objects you're dealing with, what attributes and methods they have. This is a Dataframe and the Value column is a Series.
here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

Dynamic Column Data In Pandas Dataframe (Populated By Dataframe)

I am really stuck on how to approach adding columns to Pandas dynamically. I've been trying to search for an answer to work through this, however, I am afraid when searching I may also be using the wrong terminology to summarize what I am attempting to do.
I have a dataframe returned from a query that looks like the following:
department action date
marketing close 09-01-2017
marketing close 07-01-2018
marketing close 06-01-2017
marketing close 10-21-2019
marketing open 08-01-2018
marketing other 07-14-2018
sales open 02-01-2019
sales open 02-01-2017
sales close 02-22-2019
The ultimate goal is I need a count of the types of actions grouped within particular date ranges.
My DESIRED output is something along the lines of:
department 01/01/2017-12/31/2017 01/01/2018-12/31/2018 01/01/2019-12/31/2019
open close other open close other open close other
marketing 0 2 0 1 1 1 0 1 0
sales 1 0 0 0 0 0 1 1 0
"Department" would be my index, then the contents would be filtered by date ranges specified in a list I provide, followed by the action taken (with counts). Being newer to this, I am confused as to what approach I should take - for example should I use Python (should I be looping or iterating), or should the heavy lifting be done in PANDAS. If in PANDAS, I am having difficulty determining what function to use (I've been looking at get_dummy() etc.).
I'd imagine this would be accomplished with either 1. Some type or FOR loop iterating through, 2. Adding a column to the dataframe based on the list then filtering the data underneath based on the value(s), or 3. using a function I am not aware of in Pandas
I have explained more of my thought process in this question, but I am not sure if the question is unclear which is why it may be unanswered.
Building a dataframe with dynamic date ranges using filtered results from another dataframe
There are quite a few concepts you need at once here.
First you dont yet have the count. From your desired output I took you want it yearly but you can specify any time frame you want. Then just count with groupby() and count()
In [66]: df2 = df.groupby([pd.to_datetime(df.date).dt.year, "action", "department"]).count().squeeze().rename("count")
Out[66]:
date action department
2017 close marketing 2
open sales 1
2018 close marketing 1
open marketing 1
other marketing 1
2019 close marketing 1
sales 1
open sales 1
Name: count, dtype: int64
The squeeze() and rename() are there because afterwards both the count column and the year would be called date and you get a name conflict. You could equivalently use rename(columns={'date': 'count'}) and not cast to a Series.
The second step is a pivot_table. This creates column names from values. Because there are combinations of date and action without a corresponding value, you need pivot_table.
In [62]: df2.pivot_table(index="department", columns=["date", "action"])
Out[62]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2.0 NaN 1.0 1.0 1.0 1.0 NaN
sales NaN 1.0 NaN NaN NaN 1.0 1.0
Because NaN is internally representated as floating piont, your counts were also converted to floating point. To fix that, just append fillna and convert back to int.
In [65]: df2.reset_index().pivot_table(index="department", columns=["date", "action"]).fillna(0).astype(int)
Out[65]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2 0 1 1 1 1 0
sales 0 1 0 0 0 1 1
To get exactly you output you would need to modify pd.to_datetime(df.date).dt.year. You can do this with strftime (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html). Furthermore the column ["2017", "other"] was dropped because there was no value. If this creates problems you need to include the values beforehand. After the first step a reindex and a fillna should do the trick.
EDIT: Yes it does
In [77]: new_index = pd.MultiIndex.from_product([[2017, 2018, 2019], ["close", "open", "other"], ['marketing', 'sales']], names=['date', 'action', 'department'])
...:
In [78]: df3 = df2.reindex(new_index).fillna(0).astype(int).reset_index()
Out[78]:
date action department count
0 2017 close marketing 2
1 2017 close sales 0
2 2017 open marketing 0
3 2017 open sales 1
4 2017 other marketing 0
5 2017 other sales 0
6 2018 close marketing 1
.. ... ... ... ...
11 2018 other sales 0
12 2019 close marketing 1
13 2019 close sales 1
14 2019 open marketing 0
15 2019 open sales 1
16 2019 other marketing 0
17 2019 other sales 0
In [79]: df3.pivot_table(index="department", columns=["date", "action"])
Out[79]:
count
date 2017 2018 2019
action close open other close open other close open other
department
marketing 2 0 0 1 1 1 1 0 0
sales 0 1 0 0 0 0 1 1 0

Operation across observations and year is returned NaN

I have a panel dataset with a set of countries [Italy and US] for 3 years and two numeric variables ['Var1', 'Var2']. I would like to calculate the rate of change in the last three years Ex: the value for Var1 in 2019 minus the value of Var1 in 2017 divided by Var1 in 2017.
I do not understand why my code (below) returns NaN errors?
data = {'Year':[2017, 2018, 2019, 2017, 2018, 2019], 'Country':['Italy', 'Italy', 'Italy', 'US' , 'US', 'US'], 'Var1':[23,75,45, 32,13,14], 'Var2':[21,75,47, 30,11,18]}
trend = pd.DataFrame(data)
list = ['Var1', 'Var2']
for col in list:
trend[col + ' (3 Year % Change)'] = ((trend.loc[trend['Year']==2019][col]- trend.loc[trend['Year']==2017][col])/trend.loc[trend['Year']==2017][col])*100
trend
Check if this gives what you want. It is much simpler to understand.
trend['Var1_3_Year_%_Change'] = trend.groupby('Country')['Var1'].apply(lambda x : ((x-x.iloc[0]))/x.iloc[0]*100)
trend['Var2_3_Year_%_Change'] = trend.groupby('Country')['Var2'].apply(lambda x : ((x-x.iloc[0]))/x.iloc[0]*100)
trend['Var1_yearly'] = trend.groupby('Country')['Var1'].apply(lambda x : ((x-x.shift()))/x.shift()*100)
trend['Var2_yearly'] = trend.groupby('Country')['Var2'].apply(lambda x : ((x-x.shift()))/x.shift()*100)
Output
Year Country Var1 Var2 Var1_3_Year_%_Change Var2_3_Year_%_Change Var1_yearly Var2_yearly
2017 Italy 23 21 0.000000 0.000000 NaN NaN
2018 Italy 75 75 226.086957 257.142857 226.086957 257.142857
2019 Italy 45 47 95.652174 123.809524 -40.000000 -37.333333
2017 US 32 30 0.000000 0.000000 NaN NaN
2018 US 13 11 -59.375000 -63.333333 -59.375000 -63.333333
2019 US 14 18 -56.250000 -40.000000 7.692308 63.636364
If it has to be done with for loop, use
var= ['Var1','Var2']
for col in var:
trend[col + ' (3 Year % Change)'] = trend.groupby('Country')[col].apply(lambda x : ((x-x.iloc[0]))/x.iloc[0]*100)
There are a few things going wrong here with your code:
You are trying to divide pd.series and not only their arrays, and they are carrying their index, which causes the division to become NaN
If you actually pass the values, for instance by using .values after the columns filters, you will bump into a ValueError because you want two values to insert into the entire DataFrame and pandas won't like that (length should be the same). This exemplifies it:
trend.loc['Var1' + ' (3 Year % Change)'] = ((trend.loc[trend['Year']==2019, 'Var1'].values - \
trend.loc[trend['Year']==2017, 'Var1'].values)/\
trend.loc[trend['Year']==2017, 'Var1'].values)*100
ValueError: cannot set a row with mismatched columns
Not sure if you are using list as an actual variable name, but that is a reserved python word. It is not the best idea. You can read about it here
If you want to compare the values with 2017 values in your sample, you can use
groupby+shift, based on how many years to shift:
for col in ['Var1','Var2']:
trend[col + ' (3 Year % Change)'] = (trend[col] - trend.groupby('Country').shift(2)[col])/trend.groupby('Country').shift(2)[col]
Out[1]:
Year Country Var1 Var2 Var1 (3 Year % Change) Var2 (3 Year % Change)
0 2017 Italy 23 21 NaN NaN
1 2018 Italy 75 75 NaN NaN
2 2019 Italy 45 47 0.956522 1.238095
3 2017 US 32 30 NaN NaN
4 2018 US 13 11 NaN NaN
5 2019 US 14 18 -0.562500 -0.400000

Categories

Resources