Lets assume I have the following three dataframes:
Dataframe 1:
df1 = {'year': ['2010','2012','2014','2015'], 'count': [1,1,1,1]}
df1 = pd.DataFrame(data=df1)
df1 = df1.set_index('year')
df1
year count
2010 1
2012 1
2014 1
2015 1
Dataframe 2:
df2 = {'year': ['2010','2011','2016','2017'], 'count': [2,1,3,1]}
df2 = pd.DataFrame(data=df2)
df2 = df2.set_index('year')
df2
year count
2010 2
2011 1
2016 3
2017 1
Dataframe 3:
df3 = {'year': ['2010','2011','2012','2013','2014','2015','2017'], 'count': [4,2,5,4,4,1,1]}
df3 = pd.DataFrame(data=df3)
df3 = df3.set_index('year')
df3
year count
2010 4
2011 2
2012 5
2013 4
2014 4
2015 1
2017 1
Now I want to have three dataframes with all the years and counts. For example if df1 has missing years 2011, 2013, 2016, 2017 then these are added in the index of df1 with counts against each of the new added indexes as 0.
So my output would be something like this for df1:
year count
2010 1
2012 1
2014 1
2015 1
2011 0
2013 0
2016 0
2017 0
And similarly for df2 and df3 as well. Thanks.
You can use union with reindex:
idx = df1.index.union(df2.index).union(df3.index)
print (idx)
Index(['2010', '2011', '2012', '2013',
'2014', '2015', '2016', '2017'], dtype='object', name='year')
Another solution:
from functools import reduce
idx = reduce(np.union1d,[df1.index, df2.index, df3.index])
print (idx)
['2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017']
df1 = df1.reindex(idx, fill_value=0)
print (df1)
count
year
2010 1
2011 0
2012 1
2013 0
2014 1
2015 1
2016 0
2017 0
df2 = df2.reindex(idx, fill_value=0)
print (df2)
count
year
2010 2
2011 1
2012 0
2013 0
2014 0
2015 0
2016 3
2017 1
df3 = df3.reindex(idx, fill_value=0)
print (df3)
count
year
2010 4
2011 2
2012 5
2013 4
2014 4
2015 1
2016 0
2017 1
Use reindex on all_years like
In [257]: all_years = df1.index | df2.index | df3.index
In [258]: df1.reindex(all_years, fill_value=0)
Out[258]:
count
year
2010 1
2011 0
2012 1
2013 0
2014 1
2015 1
2016 0
2017 0
In [259]: df2.reindex(all_years, fill_value=0)
Out[259]:
count
year
2010 2
2011 1
2012 0
2013 0
2014 0
2015 0
2016 3
2017 1
I would go with union you can also use unique i.e
idx = pd.Series(np.concatenate([df1.index,df2.index,df3.index])).unique()
# or idx = set(np.concatenate([df1.index,df2.index,df3.index]))
df1.reindex(idx).fillna(0)
count
year
2010 1.0
2012 1.0
2014 1.0
2015 1.0
2011 0.0
2016 0.0
2017 0.0
2013 0.0
One can also use iteration:
# find missing years:
morelist = [ j # items which satisfy following criteria
# list of all numbers converted to strings:
for j in map(lambda x: str(x), range(2010, 2018, 1))
if j not in df1.index ] # those not in current index
# create a dataframe to be added:
df2add = pd.DataFrame(data=[0]*len(morelist),
columns=['count'],
index=morelist)
# add new dataframe to original:
df1 = pd.concat([df1, df2add])
print(df1)
Output:
count
2010 1
2012 1
2014 1
2015 1
2011 0
2013 0
2016 0
2017 0
Related
I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()
working on jupyter, my dataframe have number of transaction per customer per year and field that indicates the "trend - up for more transactions than last year, down for less transaction than last year, null for the first year.
I want to create a numerator that for every "up" per customer will raised by 1 and for every "down" will "reduced" by 1.
I understand that I need first to sort the df and than to build a loop that will run on the number of customers and an inside loop that will run for every year but I need help.
DF SAMPLE:
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [NaN,'down','up','up',NaN,'down','up']
})
this is what I did so far:
df =pd.read_excel('totals_new.xlsx',sheet_name='Sheet1').sort_values(['group number', 'year'])
noofgroups = len(df['group number'].unique())
yearspergroup = df.groupby('group number')['year'].nunique()
vtrend =0
for i in noofgroups:
for j in yearspergroup:
if df["trend"] == "up":
vtrend = vtrend+1
if df["trend"] == "down":
vtrend = vtrend-1
IIUC, you can use nested np.where() to convert your trend column and then perform a groupby() and agg(). Take this sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group number': [1,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,2,2,1,2,1,2],
'year': ['2017','2016','2018','2017','2016','2018','2017','2016','2018','2017','2016','2018',
'2017','2016','2018','2017','2016','2018','2017','2016','2018','2017'],
'trend': ['up','down','up',np.nan,'up','down',np.nan,'up','up','up','down',
'up',np.nan,'up','up','up','down','up','up','up',np.nan,'down']
})
Yields:
group number year trend
0 1 2017 up
1 1 2016 down
2 1 2018 up
3 1 2017 NaN
4 1 2016 up
5 1 2018 down
6 1 2017 NaN
7 2 2016 up
8 2 2018 up
9 2 2017 up
10 2 2016 down
11 2 2018 up
12 2 2017 NaN
13 1 2016 up
14 1 2018 up
15 1 2017 up
16 2 2016 down
17 2 2018 up
18 1 2017 up
19 2 2016 up
20 1 2018 NaN
21 2 2017 down
Then:
df['trend'] = np.where(df['trend']=='up', 1, np.where(df['trend']=='down', -1, 0))
df.groupby(['group number','year']).agg({'trend': 'sum'})
Returns:
trend
group number year
1 2016 1
2017 3
2018 1
2 2016 0
2017 0
2018 3
This case is probably closed by now but, here's a possible solution since it did not come to a conclusion previously.
import pandas as pd
"""
In this case, the original dataframe is already properly sorted by group number and year.
If it isn't, the 2 columns should be sorted first
"""
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [np.nan,'down','up','up', np.nan,'down','up']
})
df['trend_val'] = df.loc[df['trend'].isna() == False, 'trend'].map(lambda x: -1 if x == 'down' else 1)
df.join(df.groupby('group number')['trend_val'].cumsum(), rsuffix='_cumulative')
>>>df
group number year trend trend_val trend_val_cumulative
0 1 2012 NaN NaN NaN
1 1 2013 down -1.0 -1.0
2 1 2014 up 1.0 0.0
3 1 2015 up 1.0 1.0
4 3 2011 NaN NaN NaN
5 3 2012 down -1.0 -1.0
6 3 2013 up 1.0 0.0
What is the best way to achieve this:
test = pd.DataFrame([2,3,4])
test1 = test.copy()
test2 = test.copy()
test1['start'] = 2017
test1['end'] = 2018
test2['start'] = 2018
test2['end'] = 2019
test = pd.concat([test1, test2])
With the following result:
0 start end
0 2 2017 2018
1 3 2017 2018
2 4 2017 2018
0 2 2018 2019
1 3 2018 2019
2 4 2018 2019
I think there will be a more elegant way ;)
Update (full picture):
DataFrame1 columns: id, year, value
DataFrame2 columns: start, end
result: id, start, end, avg of value for each id in DataFrame1 and each start/end combination of DataFrame2
data:
id year value
1 2016 -0,232
1 2017 -0,432
1 2018 -0,532
1 2019 -0,632
1 2020 -0,682
2 2016 0,768
2 2017 0,568
2 2018 0,468
2 2019 0,368
2 2020 0,318
2 2021 0,268
start end
2017 2018
2017 2019
2018 2019
result:
id start end avg_value
1 2017 2018 -0,48
1 2017 2019 -0,53
1 2018 2019 -0,58
2 2017 2018 0,52
2 2017 2019 0,47
2 2018 2019 0,42
The original question was to build up the result dataframe (as first step without the avg_value). It should calculate the average within the years where start and end is "included".
Use cross join first and then custom function:
df1['value'] = df1['value'].replace(',','.', regex=True).astype(float)
def f(x):
return df1.loc[df1['year'].between(x['start'], x['end']) &
(df1['id'] == x['id']), 'value'].mean()
df = (pd.merge(df1[['id']].drop_duplicates().assign(a=1), df2.assign(a=1), on='a')
.drop('a',1))
df['avg_value'] = df.apply(f, axis=1)
print (df)
id start end avg_value
0 1 2017 2018 -0.482
1 1 2017 2019 -0.532
2 1 2018 2019 -0.582
3 2 2017 2018 0.518
4 2 2017 2019 0.468
5 2 2018 2019 0.418
The title might be a bit confusing, this is what I want to do:
I would like to convert this dataframe
pd.DataFrame({'name':['A','B','C'],'date1':[1999,2000,2001],'date2':[2011,2012,2013]})
date1 date2 name
0 1999 2011 A
1 2000 2012 B
2 2001 2013 C
Into the following:
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
I've been trying to do pivot tables and transposing, but with no luck.
You can use melt, remove column by drop and last sort_values:
print (pd.melt(df, id_vars='name', value_name='dates')
.drop('variable', axis=1)
.sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Another solution with unstack and sort_index:
print (df.set_index('name')
.unstack()
.reset_index(drop=True, level=0)
.sort_index()
.reset_index(name='dates')[['dates','name']])
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
Solution with lreshape and sort_values:
print (pd.lreshape(df, {'dates':['date1', 'date2']}).sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Numpy solution with numpy.repeat and flattening by numpy.ravel:
df2 = pd.DataFrame({
"name": np.repeat(df.name, 2),
"dates": df[['date1','date2']].values.ravel()})
print (df2)
dates name
0 1999 A
0 2011 A
1 2000 B
1 2012 B
2 2001 C
2 2013 C
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
Year Month Year_month
2009 2 2009/2
2009 3 2009/3
2007 4 2007/3
2006 10 2006/10
Year_month
200902
200903
200704
200610
I would like to combine the year and month columns into the format as Year_month (i.e. replace the original one). How could I do it? The following approach seems not working in Python. Thanks.
def f(x, y)
return x*100+y
for i in range(0,filename.shape[0]):
filename['Year_month'][i] = f(filename['year'][i] ,filename['month'][i])
I think you can use zfill:
df['Year_month'] = df.Year.astype(str) + df.Month.astype(str).str.zfill(2)
print df
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610
df = df.read_clipboard()
Year Month Year_month
0 2009 2 2009/2
1 2009 3 2009/3
2 2007 4 2007/3
3 2006 10 2006/10
df['Year_month'] = df.apply(lambda row: str(row.Year)+str(row.Month).zfill(2), axis=1)
Year Month Year_month
0 2009 2 200902
1 2009 3 200903
2 2007 4 200704
3 2006 10 200610