Loop through timeseries and fill missing data - Python

Loop through timeseries and fill missing data - Python - python

I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.

I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0

Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()

Related

pivot dataframe using columns and values

I have data frame like
Year Month Date X Y
2015 5 1 0.21120733 0.17662421
2015 5 2 0.36878636 0.14629167
2015 5 3 0.27969632 0.37910569
2016 5 1 -1.2968733 8.29E-02
2016 5 2 -1.1575716 -0.20657887
2016 5 3 -1.0049003 -0.39670503
2017 5 1 -1.5630698 1.1710221
2017 5 2 -1.70889 0.93349206
2017 5 3 -1.8548334 0.86701781
2018 5 1 -7.94E-02 0.3962194
2018 5 2 -2.91E-02 0.39321879
I want to make it like
2015 2016 2017 2018
0.21120733 -1.2968733 -1.5630698 -7.94E-02
0.36878636 -1.1575716 -1.70889 -2.91E-02
0.27969632 -1.0049003 -1.8548334 NA
I tried using df.pivot(columns='Year',values='X') but the answer is not as expected

Try passing index in pivot():
out=df.pivot(columns='Year',values='X',index='Date')
#If needed use:
out=out.rename_axis(index=None,columns=None)
OR
Try via agg() and dropna():
out=df.pivot(columns='Year',values='X').agg(sorted,key=pd.isnull).dropna(how='all')
#If needed use:
out.columns.names=[None]
output of out:
2015 2016 2017 2018
0 0.211207 -1.296873 -1.563070 -0.0794
1 0.368786 -1.157572 -1.708890 -0.0291
2 0.279696 -1.004900 -1.854833 NaN

Changing an existing column conditional on two other column

I have a data set:
ID Fv_year HP_b_year HP_e_year
1 2010 0 2012
2 0 2009 2011
3 2000 0 2008
4 2001 0 0
I want generate:
ID Fv_year HP_b_year HP_e_year
1 2010 2010 2012
2 0 2009 2011
3 2000 2000 2008
4 2001 0 0
In word, when Fv_year >0 , HP_b_year =0 and HP_e_year>0 then I want to make HP_b_year = Fv_year, otherwise keep HP_b_year as it was before. I have used following cod:
def myfunc(x,y,z):
if x == 0 and y>0 and z>0:
return y
else:
return x
df['HP_b_year'] = df.apply(lambda x: myfunc(x.HP_b_year, x.Fv_year, x.HP_e_year), axis=1)
But its not working

You can use loc with conditions
df.loc[(df['HP_e_year']>0) & (df['Fv_year'].ne(0)), ['HP_b_year']] = df['Fv_year'][(df['HP_e_year']>0) & (df['Fv_year'].ne(0))]
ID Fv_year HP_b_year HP_e_year
0 1 2010 2010 2012
1 2 0 2009 2011
2 3 2000 2000 2008
3 4 2001 0 0

Filter Dates in Pandas

Currently have a dataset structured the following way:
id_number start_date end_date data1 data2 data3 ...
Basically, I have a whole bunch of id's with a certain date range and then multiple columns of summary data. My problem is that I need yearly totals of the summary data. This means I need to get to a place where I can groupby year on a single occurrence of each document. However, it is not guaranteed that a document exists for a given year, and the date ranges can span multiple years. Any help would be greatly appreciated, I am quite stuck.
Sample dataframe:
df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")

Assuming we have a DataFrame df:
id_num start end value
0 1 2002-03-10 2005-04-12 1
1 1 2005-04-13 2005-05-20 2
2 1 2007-05-21 2009-08-10 3
3 2 2012-02-20 2015-02-20 4
4 3 2003-10-19 2012-12-12 5
we can create a row for each year for our start to end ranges with:
ys = [np.arange(x[0], x[1]+1) for x in zip(df['start'].dt.year, df['end'].dt.year)]
df = (pd.DataFrame(ys, df.index)
.stack()
.astype(int)
.reset_index(1, True)
.to_frame('year')
.join(df, how='left')
.reset_index())
print(df)
Here we're first creating the ys variable with the list of years for each start-end range from our DataFrame, and the df = ... is splitting these year lists into separate rows and joining back to the original DataFrame (very similar to what's done in this post: How to convert column with list of values into rows in Pandas DataFrame).
Output:
index year id_num start end value
0 0 2002 1 2002-03-10 2005-04-12 1
1 0 2003 1 2002-03-10 2005-04-12 1
2 0 2004 1 2002-03-10 2005-04-12 1
3 0 2005 1 2002-03-10 2005-04-12 1
4 1 2005 1 2005-04-13 2005-05-20 2
5 2 2007 1 2007-05-21 2009-08-10 3
6 2 2008 1 2007-05-21 2009-08-10 3
7 2 2009 1 2007-05-21 2009-08-10 3
8 3 2012 2 2012-02-20 2015-02-20 4
9 3 2013 2 2012-02-20 2015-02-20 4
10 3 2014 2 2012-02-20 2015-02-20 4
11 3 2015 2 2012-02-20 2015-02-20 4
12 4 2003 3 2003-10-19 2012-12-12 5
13 4 2004 3 2003-10-19 2012-12-12 5
14 4 2005 3 2003-10-19 2012-12-12 5
15 4 2006 3 2003-10-19 2012-12-12 5
16 4 2007 3 2003-10-19 2012-12-12 5
17 4 2008 3 2003-10-19 2012-12-12 5
18 4 2009 3 2003-10-19 2012-12-12 5
19 4 2010 3 2003-10-19 2012-12-12 5
20 4 2011 3 2003-10-19 2012-12-12 5
21 4 2012 3 2003-10-19 2012-12-12 5
Note:
I changed the original ranges to test cases where there are some years missing for some id_num, e.g. for id_num=1 we have years 2002-2005, 2005-2005 and 2007-2009, so we should not get 2006 for id_num=1 in the output (and we don't, so it passes the test)

I've taken your example and added some random values so we have something to work with:
df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")
np.random.seed(0) # seeding the random values for reproducibility
df['value'] = np.random.random(len(df))
So far we have:
id_num start end value
0 1 2002-03-10 2005-04-12 0.548814
1 1 2005-04-13 2005-05-20 0.715189
2 1 2005-05-21 2009-08-10 0.602763
3 2 2012-02-20 2015-02-20 0.544883
4 3 2003-10-19 2012-12-12 0.423655
We want values at the end of the year for each given date, whether it is beginning or end. So we will treat all dates the same. We just want date + user + value:
tmp = df[['end', 'value']].copy()
tmp = tmp.rename(columns={'end':'start'})
new = pd.concat([df[['start', 'value']], tmp], sort=True)
new['id_num'] = df.id_num.append(df.id_num) # doubling the id numbers
Giving us:
start value id_num
0 2002-03-10 0.548814 1
1 2005-04-13 0.715189 1
2 2005-05-21 0.602763 1
3 2012-02-20 0.544883 2
4 2003-10-19 0.423655 3
0 2005-04-12 0.548814 1
1 2005-05-20 0.715189 1
2 2009-08-10 0.602763 1
3 2015-02-20 0.544883 2
4 2012-12-12 0.423655 3
Now we can group by ID number and year:
new = new.groupby(['id_num', new.start.dt.year]).sum().reset_index(0).sort_index()
id_num value
start
2002 1 0.548814
2003 3 0.423655
2005 1 2.581956
2009 1 0.602763
2012 2 0.544883
2012 3 0.423655
2015 2 0.544883
And finally, for each user we expand the range to have every year in between, filling forward missing data:
new = new.groupby('id_num').apply(lambda x: x.reindex(pd.RangeIndex(x.index.min(), x.index.max() + 1)).fillna(method='ffill')).drop(columns='id_num')
value
id_num
1 2002 0.548814
2003 0.548814
2004 0.548814
2005 2.581956
2006 2.581956
2007 2.581956
2008 2.581956
2009 0.602763
2 2012 0.544883
2013 0.544883
2014 0.544883
2015 0.544883
3 2003 0.423655
2004 0.423655
2005 0.423655
2006 0.423655
2007 0.423655
2008 0.423655
2009 0.423655
2010 0.423655
2011 0.423655
2012 0.423655

ranking transactions trend for each customer per year

working on jupyter, my dataframe have number of transaction per customer per year and field that indicates the "trend - up for more transactions than last year, down for less transaction than last year, null for the first year.
I want to create a numerator that for every "up" per customer will raised by 1 and for every "down" will "reduced" by 1.
I understand that I need first to sort the df and than to build a loop that will run on the number of customers and an inside loop that will run for every year but I need help.
DF SAMPLE:
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [NaN,'down','up','up',NaN,'down','up']
})
this is what I did so far:
df =pd.read_excel('totals_new.xlsx',sheet_name='Sheet1').sort_values(['group number', 'year'])
noofgroups = len(df['group number'].unique())
yearspergroup = df.groupby('group number')['year'].nunique()
vtrend =0
for i in noofgroups:
for j in yearspergroup:
if df["trend"] == "up":
vtrend = vtrend+1
if df["trend"] == "down":
vtrend = vtrend-1

IIUC, you can use nested np.where() to convert your trend column and then perform a groupby() and agg(). Take this sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group number': [1,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,2,2,1,2,1,2],
'year': ['2017','2016','2018','2017','2016','2018','2017','2016','2018','2017','2016','2018',
'2017','2016','2018','2017','2016','2018','2017','2016','2018','2017'],
'trend': ['up','down','up',np.nan,'up','down',np.nan,'up','up','up','down',
'up',np.nan,'up','up','up','down','up','up','up',np.nan,'down']
})
Yields:
group number year trend
0 1 2017 up
1 1 2016 down
2 1 2018 up
3 1 2017 NaN
4 1 2016 up
5 1 2018 down
6 1 2017 NaN
7 2 2016 up
8 2 2018 up
9 2 2017 up
10 2 2016 down
11 2 2018 up
12 2 2017 NaN
13 1 2016 up
14 1 2018 up
15 1 2017 up
16 2 2016 down
17 2 2018 up
18 1 2017 up
19 2 2016 up
20 1 2018 NaN
21 2 2017 down
Then:
df['trend'] = np.where(df['trend']=='up', 1, np.where(df['trend']=='down', -1, 0))
df.groupby(['group number','year']).agg({'trend': 'sum'})
Returns:
trend
group number year
1 2016 1
2017 3
2018 1
2 2016 0
2017 0
2018 3

This case is probably closed by now but, here's a possible solution since it did not come to a conclusion previously.
import pandas as pd
"""
In this case, the original dataframe is already properly sorted by group number and year.
If it isn't, the 2 columns should be sorted first
"""
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [np.nan,'down','up','up', np.nan,'down','up']
})
df['trend_val'] = df.loc[df['trend'].isna() == False, 'trend'].map(lambda x: -1 if x == 'down' else 1)
df.join(df.groupby('group number')['trend_val'].cumsum(), rsuffix='_cumulative')
>>>df
group number year trend trend_val trend_val_cumulative
0 1 2012 NaN NaN NaN
1 1 2013 down -1.0 -1.0
2 1 2014 up 1.0 0.0
3 1 2015 up 1.0 1.0
4 3 2011 NaN NaN NaN
5 3 2012 down -1.0 -1.0
6 3 2013 up 1.0 0.0

Python function definition on two list

Year Month Year_month
2009 2 2009/2
2009 3 2009/3
2007 4 2007/3
2006 10 2006/10
Year_month
200902
200903
200704
200610
I would like to combine the year and month columns into the format as Year_month (i.e. replace the original one). How could I do it? The following approach seems not working in Python. Thanks.
def f(x, y)
return x*100+y
for i in range(0,filename.shape[0]):
filename['Year_month'][i] = f(filename['year'][i] ,filename['month'][i])

I think you can use zfill:
df['Year_month'] = df.Year.astype(str) + df.Month.astype(str).str.zfill(2)
print df
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610

df = df.read_clipboard()
Year Month Year_month
0 2009 2 2009/2
1 2009 3 2009/3
2 2007 4 2007/3
3 2006 10 2006/10
df['Year_month'] = df.apply(lambda row: str(row.Year)+str(row.Month).zfill(2), axis=1)
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loop through timeseries and fill missing data - Python - python

Related

pivot dataframe using columns and values

Changing an existing column conditional on two other column

Filter Dates in Pandas

ranking transactions trend for each customer per year

Python function definition on two list

Categories

Resources