Using bfill with a chosen number

Using bfill with a chosen number - python

I have a data frame column like so:
Year Rank
2017 Nan
2017 Nan
2017 3
2017 4
2017 5
.
.
2016 Nan
2016 Nan
2016 3
2016 4
2016 5
.
.
Can I use bfill to replace the first two value so my column looks like this...
Year Rank
2017 1
2017 2
2017 3
2017 4
2017 5
.
.
2016 1
2016 2
2016 3
2016 4
2016 5
.
.
Or is there an easier way than using bfill? Thanks in advance

Use parameter limit in fillna:
df['Rank'] = df['Rank'].fillna(1, limit=1)
df['Rank'] = df['Rank'].fillna(2, limit=2)
...and if necessary call function per groups:
def f(x):
x = x.fillna(1, limit=1)
x = x.fillna(2, limit=2)
return x
df['New'] = df.groupby('Year')['Rank'].apply(f)
print (df)
Year Rank New
0 2017 NaN 1.0
1 2017 NaN 2.0
2 2017 3.0 3.0
3 2017 4.0 4.0
4 2017 5.0 5.0
5 2016 NaN 1.0
6 2016 NaN 2.0
7 2016 5.0 5.0
8 2016 6.0 6.0
9 2016 10.0 10.0

Look this document from PandasDataFrame.fillna

Related

Calculate months elapsed since start value in pandas dataframe

I have a dataframe that looks as such
df = {'CAL_YEAR':[2021,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2023,2023]
'CAL_MONTH' :[12,1,2,3,4,5,6,7,8,9,10,11,12,1,2]}
I want to calculate a months elapsed columns which should look like this
df = {'CUM_MONTH':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]}
how can I do this?
my starting month would be 12/2021 or 12/31/2021 (do not care about dates here I only care about the months elapsed). This is economic scenario data but the format of the source data is not in the way we need it.

IIUC:
multiplier = {'CAL_YEAR': 12, 'CAL_MONTH': 1}
df.assign(
CUM_MONTH=df[multiplier].diff().mul(multiplier).sum(axis=1).cumsum()
)
CAL_YEAR CAL_MONTH CUM_MONTH
0 2021 12 0.0
1 2022 1 1.0
2 2022 2 2.0
3 2022 3 3.0
4 2022 4 4.0
5 2022 5 5.0
6 2022 6 6.0
7 2022 7 7.0
8 2022 8 8.0
9 2022 9 9.0
10 2022 10 10.0
11 2022 11 11.0
12 2022 12 12.0
13 2023 1 13.0
14 2023 2 14.0

I basically did the above method but in numerous steps. Did not use diff() , sum() and cumsum() functions.
start_year = int(data["VALUATION_DATE"][0][-4:])
data = data.astype({"CAL_YEAR": "int","CAL_MONTH": "int"})
data["CAL_YEAR_ELAPSED"] = data["CAL_YEAR"] - (start_year+1)
data["CumMonths"] = data["CAL_MONTH"] + 12 * data["CAL_YEAR_ELAPSED"] +1

Balancing a panel data for regression

I have a dataframe:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30
I wanna create a balanced data such that:
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2009 NaN
5 2 def10 2010 20
6 3 ghk 2008 NaN
7 3 ghk 2009 30
8 3 ghk 2009 NaN
if I use the following code:
df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))
This gives me following error:
cannot handle a non-unique multi-index!

You are close to the solution. You can amend your code slightly as follows:
idx = pd.MultiIndex.from_product([df['id'].unique(),range(df.year.min(),df.year.max()+1)],names=['id','year'])
df2 = df.set_index(['id', 'year']).reindex(idx).reset_index()
df2['city'] = df2.groupby('id')['city'].ffill().bfill()
Changes to your codes:
Create the MultiIndex by using unique values of id instead of from index
Set index on both id and year before reindex()
Fill-in the NaN values of column city by non-NaN entries of the same id
Result:
print(df2)
id year city value
0 1 2008 abc 10.0
1 1 2009 abc 20.0
2 1 2010 abc 30.0
3 2 2008 def10 10.0
4 2 2009 def10 NaN
5 2 2010 def10 20.0
6 3 2008 ghk NaN
7 3 2009 ghk 30.0
8 3 2010 ghk NaN
Optionally, you can re-arrange the column sequence, if you like:
df2.insert(2, 'year', df2.pop('year'))
print(df2)
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit
You can also do it using stack() and unstack() without using reindex(), as follows:
(df.set_index(['id', 'city', 'year'], append=True)
.unstack()
.groupby(level=[1, 2]).max()
.stack(dropna=False)
).reset_index()
Output:
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN

Pivot the table and stack year without drop NaN values:
>>> df.pivot(["id", "city"], "year", "value") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit: case of duplicate entries
I slightly modified your original dataframe:
df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})
>>> df
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30 # The problem is here
6 3 ghk 2009 40 # same (id, city, year)
You need to take a decision. Do you want to keep the row 5 or 6 or apply a math function (mean, sum, ...). Imagine you want the mean for (3, ghk, 2009):
>>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 35.0 # <- mean of (30, 40)
8 3 ghk 2010 NaN

Fill Pandas dataframe rows, whose value is a 0 or NaN, with a formula that have to be calculated on specific rows of another column

I have a dateframe where values in the "price" column are different depending on both the values in the "quantity" and "year" columns. For example for a quantity equal to 2 I have a price equal to 2 in the 2017 and equal to 4 in the 2018. I would like to fill the rows for 2019, that have a 0 and NaN value, with values from 2018.
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,np.NaN,np.NaN,0,0,np.NaN,0,np.NaN,0,np.NaN])
})
And what if, instead of taking values from 2018, I should calculate a mean between 2017 and 2018?
I tried to readapt this question applying it to the first case (to apply data from 2018), but it doesn't work:
df['price'][df['year']==2019].fillna(df['price'][df['year'] == 2018], inplace = True)
Could you please help me?
The expected output should be a dataframe like the followings:
Df with values from 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,2,4,6,8,10,12,14,16,18])
})
Df with values that are a mean between 2017 and 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,1.5,3,4.5,6,7.5,9,10.5,12,13.5])
})

Here's one way filling with the mean of 2017 and 2018.
Start by grouping the previous year's data by the quantity and aggregating with the mean:
m = df[df.year.isin([2017, 2018])].groupby('quantity').price.mean()
Use set_index to set the quantity column as index, replace 0s by NaNs and use fillna which also accepts dictionaries to map the values according to the index:
ix = df[df.year.eq(2019)].index
df.loc[ix, 'price'] = (df.loc[ix].set_index('quantity').price
.replace(0, np.nan).fillna(m).values)
quantity year price
0 1 2017 1.0
1 2 2017 2.0
2 3 2017 3.0
3 4 2017 4.0
4 5 2017 5.0
5 6 2017 6.0
6 7 2017 7.0
7 8 2017 8.0
8 9 2017 9.0
9 1 2018 2.0
10 2 2018 4.0
11 3 2018 6.0
12 4 2018 8.0
13 5 2018 10.0
14 6 2018 12.0
15 7 2018 14.0
16 8 2018 16.0
17 9 2018 18.0
18 1 2019 1.5
19 2 2019 3.0
20 3 2019 4.5
21 4 2019 6.0
22 5 2019 7.5
23 6 2019 9.0
24 7 2019 10.5
25 8 2019 12.0
26 9 2019 13.5

Reformatting a date-frame into a new output format

I have a the output from a pivot table in dataframe (df) which is that looks like:
Year Month sum
2005 10 -1.596817e+05
11 -2.521054e+05
12 5.981900e+05
2006 1 8.686413e+05
2 1.673673e+06
3 1.218341e+06
4 4.131970e+05
5 1.090499e+05
6 1.495985e+06
7 1.736795e+06
8 1.155071e+05
...
9 7.847369e+05
10 -5.564139e+04
11 -7.435682e+05
12 1.073361e+05
2017 1 3.427652e+05
2 3.574432e+05
3 5.026018e+04
Is there a way to reformat the dataframe so the output to console would look like:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
All the values would be populated in the new table as well.

Use unstack:
In [18]: df['sum'].unstack('Month')
Out[18]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN -159681.70 -252105.4 598190.0
2006.0 868641.3 1673673.0 1218341.00 413197.0 109049.9 1495985.0 1736795.0 115507.1 784736.9 -55641.39 -743568.2 107336.1
2017.0 342765.2 357443.2 50260.18 NaN NaN NaN NaN NaN NaN NaN NaN NaN

Try df.pivot(index='year', columns='month', values='sum').
To fill you empty (if empty) year column use df.fillna(method='ffill') before the above.
Reading the answer above it should be mentioned that my suggestion works in cases where year and month aren't the index.

Expand panel data python

I am trying to expand the following data. I am a Stata user, and my problem can be fix by the command "fillin" in stata, now i am trying to rewrite this command in python and couldn't found any command that works.
For example: , transform this data frame:
(my dataframe is bigger than the example given, the example is just to illustrate what i want to do)
id year X Y
1 2008 10 20
1 2010 15 25
2 2011 2 4
2 2012 3 6
to this one
id year X Y
1 2008 10 20
1 2009 . .
1 2010 15 20
1 2011 . .
1 2012 . .
2 2008 . .
2 2009 . .
2 2010 . .
2 2011 2 4
2 2012 3 6
thank you, and sorry for my english

This can be done by using .loc[]
from itertools import product
import pandas as pd
df = pd.DataFrame([[1,2008,10,20],[1,2010,15,25],[2,2011,2,4],[2,2012,3,6]],columns=['id','year','X','Y'])
df = df.set_index(['id','year'])
# All combinations of index
#idx = list(product(df.index.levels[0], df.index.levels[1]))
idx = list(product(range(1,3), range(2008,2013)))
df.loc[idx]

Create a new multi-index from the dataframe and then reindex
years = np.tile(np.arange(df.year.min(), df.year.max()+1,1) ,2)
ids = np.repeat(df.id.unique(), df.year.max()-df.year.min()+1)
arrays = [ids.tolist(), years.tolist()]
new_idx = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['id', 'year'])
df = df.set_index(['id', 'year'])
df.reindex(new_idx).reset_index()
id year X Y
0 1 2008 10.0 20.0
1 1 2009 NaN NaN
2 1 2010 15.0 25.0
3 1 2011 NaN NaN
4 1 2012 NaN NaN
5 2 2008 NaN NaN
6 2 2009 NaN NaN
7 2 2010 NaN NaN
8 2 2011 2.0 4.0
9 2 2012 3.0 6.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using bfill with a chosen number - python

Look this document from PandasDataFrame.fillna

Related

Calculate months elapsed since start value in pandas dataframe

Balancing a panel data for regression

Fill Pandas dataframe rows, whose value is a 0 or NaN, with a formula that have to be calculated on specific rows of another column

Reformatting a date-frame into a new output format

Expand panel data python

Categories

Resources