I have data frame like
Year Month Date X Y
2015 5 1 0.21120733 0.17662421
2015 5 2 0.36878636 0.14629167
2015 5 3 0.27969632 0.37910569
2016 5 1 -1.2968733 8.29E-02
2016 5 2 -1.1575716 -0.20657887
2016 5 3 -1.0049003 -0.39670503
2017 5 1 -1.5630698 1.1710221
2017 5 2 -1.70889 0.93349206
2017 5 3 -1.8548334 0.86701781
2018 5 1 -7.94E-02 0.3962194
2018 5 2 -2.91E-02 0.39321879
I want to make it like
2015 2016 2017 2018
0.21120733 -1.2968733 -1.5630698 -7.94E-02
0.36878636 -1.1575716 -1.70889 -2.91E-02
0.27969632 -1.0049003 -1.8548334 NA
I tried using df.pivot(columns='Year',values='X') but the answer is not as expected
Try passing index in pivot():
out=df.pivot(columns='Year',values='X',index='Date')
#If needed use:
out=out.rename_axis(index=None,columns=None)
OR
Try via agg() and dropna():
out=df.pivot(columns='Year',values='X').agg(sorted,key=pd.isnull).dropna(how='all')
#If needed use:
out.columns.names=[None]
output of out:
2015 2016 2017 2018
0 0.211207 -1.296873 -1.563070 -0.0794
1 0.368786 -1.157572 -1.708890 -0.0291
2 0.279696 -1.004900 -1.854833 NaN
Related
I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()
I have a data frame where I need to identify entries which are repeating from previous years.
Input:
df1 = pd.DataFrame({'type': ['cst1', 'cst1', 'cst2','cst1','cst2','cst3','cst2','cst1','cst2','cst4','cst5','cst3'],
'year': [2017, 2017, 2017,2018,2018,2018,2018,2019,2019,2019,2019,2020]})
type year
0 cst1 2017
1 cst1 2017
2 cst2 2017
3 cst1 2018
4 cst2 2018
5 cst3 2018
6 cst2 2018
7 cst1 2019
8 cst2 2019
9 cst4 2019
10 cst5 2019
11 cst3 2020
From the above data fame compare type year wise and identify entries which are not new.
ex: first 2017 since it is starting year all entries are considered new, when identifying duplicates in 2018 need to compare with all entries of 2017 cst1 and cst2 are duplicates. 2019 should include all entries of 2018 and 2017 to identify duplicates.
output:
type year status
0 cst1 2017 0
1 cst1 2017 0
2 cst2 2017 0
3 cst1 2018 1
4 cst2 2018 1
5 cst3 2018 0
6 cst2 2018 1
7 cst1 2019 1
8 cst2 2019 1
9 cst4 2019 0
10 cst5 2019 0
11 cst3 2020 1
in the output for 2020 cst3 is identified as duplicate even though 2019 doesn't not contain type cst3. while comparing each increasing year need to consider all the passing years to identify duplicates here 2018 have type cst3 so that it is identified as duplicate and labeled as 1
You can get the minimum year per group and then check if the rows in your data frame are in those minimums:
pd.merge(df1, df1.groupby("type").min().reset_index(), "outer", indicator = "status")\
.replace({"status": {"both": 0, "left_only": 1}})
Output
type year status
0 cst1 2017 0
1 cst1 2017 0
2 cst2 2017 0
3 cst1 2018 1
4 cst2 2018 1
5 cst2 2018 1
6 cst3 2018 0
7 cst1 2019 1
8 cst2 2019 1
9 cst4 2019 0
10 cst5 2019 0
11 cst3 2020 1
DOCS
pandas.DataFrame.groupby
pandas.merge
pandas.DataFrame.replace
I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0
I've been given a dataset that has dates as an integer using the format 52019 for May 2019. I've put it into a Pandas DataFrame, and I need to extract that date format into a month column and year column, but I can't figure out how to do that for an int64 datatype or how to handle it for the two digit months. So I want to take something like
ID Date
1 22019
2 32019
3 52019
5 102019
and make it become
ID Month Year
1 2 2019
2 3 2019
3 5 2019
5 10 2019
What should I do?
divmod
df['Month'], df['Year'] = np.divmod(df.Date, 10000)
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Without mutating original dataframe using assign
df.assign(**dict(zip(['Month', 'Year'], np.divmod(df.Date, 10000))))
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Using // and %
df['Month'], df['Year'] = df.Date//10000,df.Date%10000
df
Out[528]:
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Use:
s=pd.to_datetime(df.pop('Date'),format='%m%Y') #convert to datetime and pop deletes the col
df['Month'],df['Year']=s.dt.month,s.dt.year #extract month and year
print(df)
ID Month Year
0 1 2 2019
1 2 3 2019
2 3 5 2019
3 5 10 2019
str.extract can handle the tricky part of figuring out whether the Month has 1 or 2 digits.
(df['Date'].astype(str)
.str.extract(r'^(?P<Month>\d{1,2})(?P<Year>\d{4})$')
.astype(int))
Month Year
0 2 2019
1 3 2019
2 5 2019
3 10 2019
You may also use string slicing if it's guaranteed your numbers have only 5 or 6 digits (if not, use str.extract above):
u = df['Date'].astype(str)
df['Month'], df['Year'] = u.str[:-4], u.str[-4:]
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
What is the best way to achieve this:
test = pd.DataFrame([2,3,4])
test1 = test.copy()
test2 = test.copy()
test1['start'] = 2017
test1['end'] = 2018
test2['start'] = 2018
test2['end'] = 2019
test = pd.concat([test1, test2])
With the following result:
0 start end
0 2 2017 2018
1 3 2017 2018
2 4 2017 2018
0 2 2018 2019
1 3 2018 2019
2 4 2018 2019
I think there will be a more elegant way ;)
Update (full picture):
DataFrame1 columns: id, year, value
DataFrame2 columns: start, end
result: id, start, end, avg of value for each id in DataFrame1 and each start/end combination of DataFrame2
data:
id year value
1 2016 -0,232
1 2017 -0,432
1 2018 -0,532
1 2019 -0,632
1 2020 -0,682
2 2016 0,768
2 2017 0,568
2 2018 0,468
2 2019 0,368
2 2020 0,318
2 2021 0,268
start end
2017 2018
2017 2019
2018 2019
result:
id start end avg_value
1 2017 2018 -0,48
1 2017 2019 -0,53
1 2018 2019 -0,58
2 2017 2018 0,52
2 2017 2019 0,47
2 2018 2019 0,42
The original question was to build up the result dataframe (as first step without the avg_value). It should calculate the average within the years where start and end is "included".
Use cross join first and then custom function:
df1['value'] = df1['value'].replace(',','.', regex=True).astype(float)
def f(x):
return df1.loc[df1['year'].between(x['start'], x['end']) &
(df1['id'] == x['id']), 'value'].mean()
df = (pd.merge(df1[['id']].drop_duplicates().assign(a=1), df2.assign(a=1), on='a')
.drop('a',1))
df['avg_value'] = df.apply(f, axis=1)
print (df)
id start end avg_value
0 1 2017 2018 -0.482
1 1 2017 2019 -0.532
2 1 2018 2019 -0.582
3 2 2017 2018 0.518
4 2 2017 2019 0.468
5 2 2018 2019 0.418