How do I fill missing sequential data in a dataframe? - python

I have a dataset that is formatted like this:
index string
1 2008
1 2009
1 2010
2
2
2
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5
5
5
I would like to fill in the missing data with the same sequence like this:
index string
1 2008
1 2009
1 2010
2 <-2008
2 <-2009
2 <-2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 <-2008
5 <-2009
5 <-2010
So the final result looks like this:
index string
1 2008
1 2009
1 2010
2 2008
2 2009
2 2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 2008
5 2009
5 2010
I am currently doing this in excel and it is an impossible task because of the number of rows that need to be filled.
I tried using fillna(method = 'ffill', limit = 2, inplace = True), but this will only fill data with what is in the previous cell. Any help is appreciated.

You can try:
l = [2008, 2009, 2010]
# is the row NaN?
m = df['string'].isna()
# update with 2008, 2009, etc. in a defined order
df.loc[m, 'string'] = (df.groupby('index').cumcount()
.map(dict(enumerate(l)))
)
# convert dtype if needed
df['string'] = df['string'].convert_dtypes()
Alternative just defining a start year:
start = 2008
m = df['string'].isna()
df.loc[m, 'string'] = df.groupby('index').cumcount().add(start)
df['string'] = df['string'].convert_dtypes()
Output:
index string
0 1 2008
1 1 2009
2 1 2010
3 2 2008
4 2 2009
5 2 2010
6 3 2008
7 3 2009
8 3 2010
9 4 2008
10 4 2009
11 4 2010
12 5 2008
13 5 2009
14 5 2010

Try this:
# Find where is Nan
m = df['string'].isna()
# Compute how many Nan with 'm.sum()'
# Replace 'Nan's with [2008, 2009, 2010]*(sum_of_Nan / 3) -> [2008,2009,2010,2008,2009,2010,...]
df.loc[m, 'string'] = [2008, 2009, 2010]*(m.sum()//3)
Output:
string
index
1 2008
1 2009
1 2010
2 2008
2 2009
2 2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 2008
5 2009
5 2010

Related

How to extract the last year (YYYY) from a YYYY-YY format column in Pandas

I am trying to extract the last year (YY) of a fiscal date string in the format of YYYY-YY. e.g The last year of this '1999-00' would be 2000.
Current code seems to cover most cases other than this.
import pandas as pd
import numpy as np
test_df = pd.DataFrame(data={'Season':['1996-97', '1997-98', '1998-99',
'1999-00', '2000-01', '2001-02',
'2002-03','2003-04','2004-05',
'2005-06','2006-07','2007-08',
'2008-09', '2009-10', '2010-11', '2011-12'],
'Height':np.random.randint(20, size=16),
'Weight':np.random.randint(40, size=16)})
I need a logic to include a case where if it is the end of the century then my apply method should add to the first two digits, I believe this is the only case I am missing.
Current code is as follows:
test_df['Season'] = test_df['Season'].apply(lambda x: x[0:2] + x[5:7])
This should work too:
pd.to_numeric(test_df['Season'].str.split('-').str[0]) + 1
Output:
0 1997
1 1998
2 1999
3 2000
4 2001
5 2002
6 2003
7 2004
8 2005
9 2006
10 2007
11 2008
12 2009
13 2010
14 2011
15 2012
You can use .str.extract to extract the first four digits
df['Season'] = df['Season'].str.extract('^(\d{4})').astype(int).add(1)
Season Height Weight
0 1997 4 22
1 1998 18 4
2 1999 19 27
3 2000 7 10
4 2001 19 9
5 2002 18 31
6 2003 19 9
7 2004 18 29
8 2005 13 17
9 2006 13 30
10 2007 5 14
11 2008 15 3
12 2009 13 10
13 2010 15 8
14 2011 0 23
15 2012 2 38
Here you go! Use the following function instead of the lambda:
def get_season(string):
century = int(string[:2])
preyear = int(string[2:4])
postyear = int(string[5:7])
if postyear < preyear:
century += 1
# zfill is so that "1" becomes "01"
return str(century).zfill(2) + str(postyear).zfill(2)
I use the fiscalyear module.
import numpy as np
import pandas as pd
import fiscalyear as fy
...
test_df['Season'] = test_df['Season'].apply(lambda x : fy.FiscalYear(int(x[0:4]) + 1).fiscal_year)
print(test_df)

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

Finding all values in between specific values in data frame

i have this dataframe.
df
name timestamp year
0 A 2004 1995
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
What i am doing is that on the basis of first two entries in the df['timestamp']. I am fetching all the values from df['year'] which comes in between these two entries. Which in this case is (2004-2008).
y1 = df['timestamp'].iloc[0]
y2 = df['timestamp'].iloc[1]
movies = df[df['year'].between(y1, y2,inclusive=True )]
movies
name timestamp year
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
This is working fine for me. But when i have greater value in first index and lower in 2nd index (e.g. 2008-2004) the result is empty.
df
name timestamp year
0 A 2008 1995
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
In this case i fetch nothing.
Expected Outcome:
What i want is if the values are greater or smaller i should get in-between values every time.
You could use Series.head and Series.agg:
y1, y2 = df['timestamp'].head(2).agg(['min', 'max'])
movies = df[df['year'].between(y1, y2,inclusive=True )]
[out]
name timestamp year
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
You can fix that by changing just two lines of code:
y1 = min(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
y2 = max(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
in this way y1 is always less or equal than y2.
However as #ALollz pointed out it is possible to save both computation and coding time by using
y1,y2 = np.sort(df['timestamp'].head(2))

How to add a column with the growth rate in a budget table in Pandas?

I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

Categories

Resources