Loop Optimization in python - python

I have a dataframe df like this
Product Yr Value
A 2014 1
A 2015 3
A 2016 2
B 2015 2
B 2016 1
I want to do max cumululative ie
Product Yr Value
A 2014 1
A 2015 3
A 2016 3
B 2015 2
B 2016 2
My actual data has about 50000 products
I am writing a code like:
df2=pd.DataFrame()
for i in (df['Product'].unique()):
data3=df[df['Product']==i]
data3.sort_values(by=['Yr'])
data3['Value']=data3['Value'].cummax()
df2=df2.append(data3)
#df2 is my result
This code is taking a lot of time(~3 days) for about 50000 products and 10 years. Is there some way to speed it up?

You can use groupby.cummax instead:
df['Value'] = df.sort_values('Yr').groupby('Product').Value.cummax()
df
#Product Yr Value
#0 A 2014 1
#1 A 2015 3
#2 A 2016 3
#3 B 2015 2
#4 B 2016 2

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Loop through timeseries and fill missing data - Python

I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()

Getting sum data for smoothly shifting groups of 3 months of a months data in Pandas

I have a time series data of the following form:
Item 2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 A 0 1 2 3 4 5
1 B 5 4 3 2 1 0
This is monthly data but I want to get quarterly data of this data. A normal quarterly data would be calculated by summing up Jan-Mar and Apr-Jun and would look like this:
Item 2020 Q1 2020 Q2
0 A 3 12
1 B 12 3
I want to get smoother quarterly data so it would shift by only 1 month for each new data item, not 3 months. So it would have Jan-Mar, then Feb-Apr, then Mar-May, and Apr-Jun. So the resulting data would look like this:
Item 2020 Q1 2020 Q1 2020 Q1 2020 Q2
0 A 3 6 9 12
1 B 12 9 6 3
I believe this is similar to cumsum which can be used as follows:
df_dates = df.iloc[:,1:]
df_dates.cumsum(axis=1)
which leads to the following result:
2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 0 1 3 6 10 15
1 5 9 12 14 15 15
but instead of getting the sum over the whole time, it gets the sum of the nearest 3 months (a quarter).
I do not know how this version of cumsum is called but I saw it in many places so I believe there might be a library function for that.
Let us solve in steps
Set the index to Item column
Parse the date like columns to quarterly period
Calculate the rolling sum with window of size 3
Shift the calculated rolling sum 2 units along the columns axis and get rid of the last two columns
s = df.set_index('Item')
s.columns = pd.PeriodIndex(s.columns, freq='M').strftime('%Y Q%q')
s = s.rolling(3, axis=1).sum().shift(-2, axis=1).iloc[:, :-2]
print(s)
2020 Q1 2020 Q1 2020 Q1 2020 Q2
Item
A 3.0 6.0 9.0 12.0
B 12.0 9.0 6.0 3.0
Try with column wise groupby with axis=1:
>>> df.iloc[:, [0]].join(df.iloc[:, 1:].groupby(pd.to_datetime(df.columns[1:], format='%Y %b').quarter, axis=1).sum().add_prefix('Q'))
Item Q1 Q2
0 A 3 12
1 B 12 3
>>>
Edit:
I misread your question, to do what you want try rolling sum:
>>> x = df.rolling(3, axis=1).sum().dropna(axis='columns')
>>> df.iloc[:, [0]].join(x.set_axis('Q' + pd.to_datetime(df.columns[1:], format='%Y %b').quarter.astype(str)[:len(x.T)], axis=1))
Item Q1 Q1 Q1 Q2
0 A 3.0 6.0 9.0 12.0
1 B 12.0 9.0 6.0 3.0
>>>

pivot dataframe using columns and values

I have data frame like
Year Month Date X Y
2015 5 1 0.21120733 0.17662421
2015 5 2 0.36878636 0.14629167
2015 5 3 0.27969632 0.37910569
2016 5 1 -1.2968733 8.29E-02
2016 5 2 -1.1575716 -0.20657887
2016 5 3 -1.0049003 -0.39670503
2017 5 1 -1.5630698 1.1710221
2017 5 2 -1.70889 0.93349206
2017 5 3 -1.8548334 0.86701781
2018 5 1 -7.94E-02 0.3962194
2018 5 2 -2.91E-02 0.39321879
I want to make it like
2015 2016 2017 2018
0.21120733 -1.2968733 -1.5630698 -7.94E-02
0.36878636 -1.1575716 -1.70889 -2.91E-02
0.27969632 -1.0049003 -1.8548334 NA
I tried using df.pivot(columns='Year',values='X') but the answer is not as expected
Try passing index in pivot():
out=df.pivot(columns='Year',values='X',index='Date')
#If needed use:
out=out.rename_axis(index=None,columns=None)
OR
Try via agg() and dropna():
out=df.pivot(columns='Year',values='X').agg(sorted,key=pd.isnull).dropna(how='all')
#If needed use:
out.columns.names=[None]
output of out:
2015 2016 2017 2018
0 0.211207 -1.296873 -1.563070 -0.0794
1 0.368786 -1.157572 -1.708890 -0.0291
2 0.279696 -1.004900 -1.854833 NaN

reshape a pandas DataFrame by transposing two columns and repeating another

The title might be a bit confusing, this is what I want to do:
I would like to convert this dataframe
pd.DataFrame({'name':['A','B','C'],'date1':[1999,2000,2001],'date2':[2011,2012,2013]})
date1 date2 name
0 1999 2011 A
1 2000 2012 B
2 2001 2013 C
Into the following:
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
I've been trying to do pivot tables and transposing, but with no luck.
You can use melt, remove column by drop and last sort_values:
print (pd.melt(df, id_vars='name', value_name='dates')
.drop('variable', axis=1)
.sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Another solution with unstack and sort_index:
print (df.set_index('name')
.unstack()
.reset_index(drop=True, level=0)
.sort_index()
.reset_index(name='dates')[['dates','name']])
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
Solution with lreshape and sort_values:
print (pd.lreshape(df, {'dates':['date1', 'date2']}).sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Numpy solution with numpy.repeat and flattening by numpy.ravel:
df2 = pd.DataFrame({
"name": np.repeat(df.name, 2),
"dates": df[['date1','date2']].values.ravel()})
print (df2)
dates name
0 1999 A
0 2011 A
1 2000 B
1 2012 B
2 2001 C
2 2013 C
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

Categories

Resources