What is the best way to achieve this:
test = pd.DataFrame([2,3,4])
test1 = test.copy()
test2 = test.copy()
test1['start'] = 2017
test1['end'] = 2018
test2['start'] = 2018
test2['end'] = 2019
test = pd.concat([test1, test2])
With the following result:
0 start end
0 2 2017 2018
1 3 2017 2018
2 4 2017 2018
0 2 2018 2019
1 3 2018 2019
2 4 2018 2019
I think there will be a more elegant way ;)
Update (full picture):
DataFrame1 columns: id, year, value
DataFrame2 columns: start, end
result: id, start, end, avg of value for each id in DataFrame1 and each start/end combination of DataFrame2
data:
id year value
1 2016 -0,232
1 2017 -0,432
1 2018 -0,532
1 2019 -0,632
1 2020 -0,682
2 2016 0,768
2 2017 0,568
2 2018 0,468
2 2019 0,368
2 2020 0,318
2 2021 0,268
start end
2017 2018
2017 2019
2018 2019
result:
id start end avg_value
1 2017 2018 -0,48
1 2017 2019 -0,53
1 2018 2019 -0,58
2 2017 2018 0,52
2 2017 2019 0,47
2 2018 2019 0,42
The original question was to build up the result dataframe (as first step without the avg_value). It should calculate the average within the years where start and end is "included".
Use cross join first and then custom function:
df1['value'] = df1['value'].replace(',','.', regex=True).astype(float)
def f(x):
return df1.loc[df1['year'].between(x['start'], x['end']) &
(df1['id'] == x['id']), 'value'].mean()
df = (pd.merge(df1[['id']].drop_duplicates().assign(a=1), df2.assign(a=1), on='a')
.drop('a',1))
df['avg_value'] = df.apply(f, axis=1)
print (df)
id start end avg_value
0 1 2017 2018 -0.482
1 1 2017 2019 -0.532
2 1 2018 2019 -0.582
3 2 2017 2018 0.518
4 2 2017 2019 0.468
5 2 2018 2019 0.418
Related
I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()
I have data frame like
Year Month Date X Y
2015 5 1 0.21120733 0.17662421
2015 5 2 0.36878636 0.14629167
2015 5 3 0.27969632 0.37910569
2016 5 1 -1.2968733 8.29E-02
2016 5 2 -1.1575716 -0.20657887
2016 5 3 -1.0049003 -0.39670503
2017 5 1 -1.5630698 1.1710221
2017 5 2 -1.70889 0.93349206
2017 5 3 -1.8548334 0.86701781
2018 5 1 -7.94E-02 0.3962194
2018 5 2 -2.91E-02 0.39321879
I want to make it like
2015 2016 2017 2018
0.21120733 -1.2968733 -1.5630698 -7.94E-02
0.36878636 -1.1575716 -1.70889 -2.91E-02
0.27969632 -1.0049003 -1.8548334 NA
I tried using df.pivot(columns='Year',values='X') but the answer is not as expected
Try passing index in pivot():
out=df.pivot(columns='Year',values='X',index='Date')
#If needed use:
out=out.rename_axis(index=None,columns=None)
OR
Try via agg() and dropna():
out=df.pivot(columns='Year',values='X').agg(sorted,key=pd.isnull).dropna(how='all')
#If needed use:
out.columns.names=[None]
output of out:
2015 2016 2017 2018
0 0.211207 -1.296873 -1.563070 -0.0794
1 0.368786 -1.157572 -1.708890 -0.0291
2 0.279696 -1.004900 -1.854833 NaN
I have a data frame where I need to identify entries which are repeating from previous years.
Input:
df1 = pd.DataFrame({'type': ['cst1', 'cst1', 'cst2','cst1','cst2','cst3','cst2','cst1','cst2','cst4','cst5','cst3'],
'year': [2017, 2017, 2017,2018,2018,2018,2018,2019,2019,2019,2019,2020]})
type year
0 cst1 2017
1 cst1 2017
2 cst2 2017
3 cst1 2018
4 cst2 2018
5 cst3 2018
6 cst2 2018
7 cst1 2019
8 cst2 2019
9 cst4 2019
10 cst5 2019
11 cst3 2020
From the above data fame compare type year wise and identify entries which are not new.
ex: first 2017 since it is starting year all entries are considered new, when identifying duplicates in 2018 need to compare with all entries of 2017 cst1 and cst2 are duplicates. 2019 should include all entries of 2018 and 2017 to identify duplicates.
output:
type year status
0 cst1 2017 0
1 cst1 2017 0
2 cst2 2017 0
3 cst1 2018 1
4 cst2 2018 1
5 cst3 2018 0
6 cst2 2018 1
7 cst1 2019 1
8 cst2 2019 1
9 cst4 2019 0
10 cst5 2019 0
11 cst3 2020 1
in the output for 2020 cst3 is identified as duplicate even though 2019 doesn't not contain type cst3. while comparing each increasing year need to consider all the passing years to identify duplicates here 2018 have type cst3 so that it is identified as duplicate and labeled as 1
You can get the minimum year per group and then check if the rows in your data frame are in those minimums:
pd.merge(df1, df1.groupby("type").min().reset_index(), "outer", indicator = "status")\
.replace({"status": {"both": 0, "left_only": 1}})
Output
type year status
0 cst1 2017 0
1 cst1 2017 0
2 cst2 2017 0
3 cst1 2018 1
4 cst2 2018 1
5 cst2 2018 1
6 cst3 2018 0
7 cst1 2019 1
8 cst2 2019 1
9 cst4 2019 0
10 cst5 2019 0
11 cst3 2020 1
DOCS
pandas.DataFrame.groupby
pandas.merge
pandas.DataFrame.replace
I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0
Lets assume I have the following three dataframes:
Dataframe 1:
df1 = {'year': ['2010','2012','2014','2015'], 'count': [1,1,1,1]}
df1 = pd.DataFrame(data=df1)
df1 = df1.set_index('year')
df1
year count
2010 1
2012 1
2014 1
2015 1
Dataframe 2:
df2 = {'year': ['2010','2011','2016','2017'], 'count': [2,1,3,1]}
df2 = pd.DataFrame(data=df2)
df2 = df2.set_index('year')
df2
year count
2010 2
2011 1
2016 3
2017 1
Dataframe 3:
df3 = {'year': ['2010','2011','2012','2013','2014','2015','2017'], 'count': [4,2,5,4,4,1,1]}
df3 = pd.DataFrame(data=df3)
df3 = df3.set_index('year')
df3
year count
2010 4
2011 2
2012 5
2013 4
2014 4
2015 1
2017 1
Now I want to have three dataframes with all the years and counts. For example if df1 has missing years 2011, 2013, 2016, 2017 then these are added in the index of df1 with counts against each of the new added indexes as 0.
So my output would be something like this for df1:
year count
2010 1
2012 1
2014 1
2015 1
2011 0
2013 0
2016 0
2017 0
And similarly for df2 and df3 as well. Thanks.
You can use union with reindex:
idx = df1.index.union(df2.index).union(df3.index)
print (idx)
Index(['2010', '2011', '2012', '2013',
'2014', '2015', '2016', '2017'], dtype='object', name='year')
Another solution:
from functools import reduce
idx = reduce(np.union1d,[df1.index, df2.index, df3.index])
print (idx)
['2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017']
df1 = df1.reindex(idx, fill_value=0)
print (df1)
count
year
2010 1
2011 0
2012 1
2013 0
2014 1
2015 1
2016 0
2017 0
df2 = df2.reindex(idx, fill_value=0)
print (df2)
count
year
2010 2
2011 1
2012 0
2013 0
2014 0
2015 0
2016 3
2017 1
df3 = df3.reindex(idx, fill_value=0)
print (df3)
count
year
2010 4
2011 2
2012 5
2013 4
2014 4
2015 1
2016 0
2017 1
Use reindex on all_years like
In [257]: all_years = df1.index | df2.index | df3.index
In [258]: df1.reindex(all_years, fill_value=0)
Out[258]:
count
year
2010 1
2011 0
2012 1
2013 0
2014 1
2015 1
2016 0
2017 0
In [259]: df2.reindex(all_years, fill_value=0)
Out[259]:
count
year
2010 2
2011 1
2012 0
2013 0
2014 0
2015 0
2016 3
2017 1
I would go with union you can also use unique i.e
idx = pd.Series(np.concatenate([df1.index,df2.index,df3.index])).unique()
# or idx = set(np.concatenate([df1.index,df2.index,df3.index]))
df1.reindex(idx).fillna(0)
count
year
2010 1.0
2012 1.0
2014 1.0
2015 1.0
2011 0.0
2016 0.0
2017 0.0
2013 0.0
One can also use iteration:
# find missing years:
morelist = [ j # items which satisfy following criteria
# list of all numbers converted to strings:
for j in map(lambda x: str(x), range(2010, 2018, 1))
if j not in df1.index ] # those not in current index
# create a dataframe to be added:
df2add = pd.DataFrame(data=[0]*len(morelist),
columns=['count'],
index=morelist)
# add new dataframe to original:
df1 = pd.concat([df1, df2add])
print(df1)
Output:
count
2010 1
2012 1
2014 1
2015 1
2011 0
2013 0
2016 0
2017 0