Pandas: changing years based on an int value - python

I'm trying to subtract years from one column based on a number in another column.
This is what i mean:
base_date amount_years
0 2006-09-01 2
1 2007-04-01 4
The result would be:
base_date amount_years
0 2008-09-01 2
1 20011-04-01 4
Is there a way to achieve this in python?

Use DateOffset with apply and axis=1 for process per rows:
f = lambda x: x['base_date'] + pd.offsets.DateOffset(years=x['amount_years'])
df['base_date'] = df.apply(f, axis=1)
print (df)
base_date amount_years
0 2008-09-01 2
1 2011-04-01 4

Related

Get sum of values month, year wise using Pandas

Below is the data that I have:
There are three columns the class, date and marks
I need my output to be in the format below:
Here column headers 1,2,3 are the classes which contain the total marks stored month and year wise
My approach to this was using the following logic:
(Although I get the result, I am not satisfied with the code. I was hoping that someone could help me with a more efficient solution for the same?)
class=sorted(df.Class.unique())
dict_all = dict.fromkeys(class , 1)
for c in class:
actuals=[]
for i in range(2018,2019):
for j in range(1,13):
a = df['date'].map(lambda x : x.year == i)
b = df['date'].map(lambda x : x.month == j)
x= df[a & b & (df.class==c)].marks.sum()
actuals.append(x)
dict_all[c]=actuals
result = pd.DataFrame.from_dict(dict_all)
Input:
Class Date Total
0 3 01-01-2018 32
1 2 01-01-2018 69
2 1 01-01-2018 129
3 3 01-01-2019 12
Using df.pivot
df1 = df.pivot(index="Date",columns="Class", values="Total", ).reset_index().fillna(0)
print(df1)
Using crosstab
df1 = pd.crosstab(index=df["Date"],columns=df["Class"], values=df["Total"],aggfunc="max").reset_index().fillna(0)
print(df1)
Using groupby & unstack()
df1 = df.groupby(['Date','Class'])['Total'].max().unstack(fill_value=0).reset_index()
print(df1)
All the code(s) gives the output:
Class Date 1 2 3
0 01-01-2018 129 69 32
1 01-01-2019 0 0 12

create index or each month in array for the specific time interval

What I am trying to do is that I have beginning and end of an interval and want to create an index for each month.
I'm using pandas, but I should calculate the number of month using the following approach:
import pandas as pd
pd.period_range('2014-04', periods=<number-of-month>, freq='M')
Is there any way to create it automatically, I mean for example give it two arguments as beginning and end interval and then it creates an index for each month; in other words I mean:
pseudo-code:
pd.period_range(start='2014-04', end='2014-07', freq='M')
Expected output for the above pseudo-code is [0, 0, 0] because there are 3 month from 2014-04 to 2014-07.
Expected DataFrame to implement and want to access them by index:
index date count
0 2014-04 0
1 2014-05 0
2 2014-06 0
At first the array place zero for all of the indices and I call them count. I want to increment the count column using date. for example:
a = pd.period_range(start='2014-04', end='2014-07', freq='M')
a['2014-04'] += 1
index date count
0 2014-04 1
1 2014-05 0
2 2014-06 0
How can I implement it?
You need create PeriodIndex by period_range and then for add 1 to column counter use loc:
a = pd.period_range(start='2014-04', end='2014-07', freq='M')
df = pd.DataFrame({'count':0}, index=a)
df.loc['2014-04', 'count'] += 1
print (df)
count
2014-04 1
2014-05 0
2014-06 0
2014-07 0
Solution with Series:
a = pd.period_range(start='2014-04', end='2014-07', freq='M')
s = pd.Series(0, index=a)
s['2014-04'] += 1
print (s)
2014-04 1
2014-05 0
2014-06 0
2014-07 0
Freq: M, dtype: int64
IIUC, make pandas.Series with index = pd.date_range(...):
import pandas as pd
s = pd.Series(0, index=pd.date_range(start='2014-04', end='2019-08', freq="M"))
s['2014-04'] += 1
s.head()
Output:
2014-04-30 1
2014-05-31 0
2014-06-30 0
2014-07-31 0
2014-08-31 0
Freq: M, dtype: int64

Accessing different columns from DataFrame in transform

I want to write a transformation function accessing two columns from a DataFrame and pass it to transform().
Here is the DataFrame which I would like to modify:
print(df)
date increment
0 2012-06-01 0
1 2003-04-08 1
2 2009-04-22 3
3 2018-05-24 6
4 2006-09-25 2
5 2012-11-02 4
I would like to increment the year in column date by the number of years given variable increment. The proposed code (which does not work) is:
df.transform(lambda df: date(df.date.year + df.increment, 1, 1))
Is there a way to access individual columns in the function (here a lambda function) passed to transform()?
You can use pandas.to_timedelta :
# If necessary convert to date type first
# df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'] + pd.to_timedelta(df['increment'], unit='Y')
[out]
date increment
0 2012-06-01 00:00:00 0
1 2004-04-07 05:49:12 1
2 2012-04-21 17:27:36 3
3 2024-05-23 10:55:12 6
4 2008-09-24 11:38:24 2
5 2016-11-01 23:16:48 4
or alternatively:
df['date'] = pd.to_datetime({'year': df.date.dt.year.add(df.increment),
'month': df.date.dt.month,
'day': df.date.dt.day})
[out]
date increment
0 2012-06-01 0
1 2004-04-08 1
2 2012-04-22 3
3 2024-05-24 6
4 2008-09-25 2
5 2016-11-02 4
Your own solution could also be fixed by instead using the apply method and passing the axis=1 argument:
from datetime import date
df.apply(lambda df: date(df.date.year + df.increment, 1, 1), axis=1)

better way to create a new column instead of for loop

is there any faster way to this code?
i just want to calculate t_last - t_i and create a new column
time_ges = pd.DataFrame()
for i in range(0, len(df.GesamteMessung_Sec.index), 1):
time = df.GesamteMessung_Sec.iloc[-1]-df.GesamteMessung_Sec.iloc[i]
time_ges = time_ges.append(pd.DataFrame({'echte_Ladezeit': time}, index=[0]), ignore_index=True)
df['echte_Ladezeit'] = time_ges
this code takes a lot of computation time, is there any better way to do this?
thanks, R
You can subtract last value by column GesamteMessung_Sec and add to_frame for convert Series to DataFrame:
df = pd.DataFrame({'GesamteMessung_Sec':[10,2,1,5]})
print (df)
GesamteMessung_Sec
0 10
1 2
2 1
3 5
time_ges = (df.GesamteMessung_Sec.iloc[-1] - df.GesamteMessung_Sec).to_frame('echte_Ladezeit')
print (time_ges )
echte_Ladezeit
0 -5
1 3
2 4
3 0
If need new column of original DataFrame:
df = pd.DataFrame({'GesamteMessung_Sec':[10,2,1,5]})
df['echte_Ladezeit'] = df.GesamteMessung_Sec.iloc[-1] - df.GesamteMessung_Sec
print (df)
GesamteMessung_Sec echte_Ladezeit
0 10 -5
1 2 3
2 1 4
3 5 0

How to efficiently add rows for those data points which are missing from a sequence using pandas?

I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8

Categories

Resources