Creating a new Column "Week" from existing column of Date - python

I have a dataset which has a Column date in a continuous format. I would like to add a new column to it which takes out week from that value in the Date column.
A B
1 20050121
2 20050111
3 20050205
4 20050101
Here the B column idicates the date in the YEAR|MONTH|DAY format, I would like to add a new column to this dataset which takes in the month date from the dataset and tells us which week it belongs to, something like this:
A B C
1 20050121 3
2 20050111 2
3 20050205 5
4 20050101 1
The week starts from the 1st january of 2005. I thought of splitting the values of moth and date separately and then calculate according to these two values, How can I do this?

It seems you need strftime by http://strftime.org/:
df['C'] = pd.to_datetime(df['B'], format='%Y%m%d').dt.strftime('%W')
print (df)
A B C
0 1 20050121 03
1 2 20050111 02
2 3 20050205 05
3 4 20050101 00
If need ints:
df['C'] = pd.to_datetime(df['B'], format='%Y%m%d').dt.strftime('%W').astype(int)
print (df)
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 0
If use weekofyear get more as 50 for first week:
df['C'] = pd.to_datetime(df['B'], format='%Y%m%d').dt.weekofyear
print (df)
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 53
But is possible mask it:
dates = pd.to_datetime(df['B'], format='%Y%m%d')
m = (dates.dt.month == 1) & (dates.dt.weekofyear > 50)
df['C'] = np.where(m, 1, dates.dt.weekofyear)
print (df)
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 1

In general, this will work, but here are some confusion about year beginning
import datetime
date_from_str = datetime.datetime.strptime
df = pd.DataFrame([[1, 20050121],
[2, 20050111],
[3, 20050205],
[4, 20050101]], columns = ['A','B'])
df['C']= df['B'].astype('str').apply(lambda date:
date_from_str(date,'%Y%m%d').isocalendar()[1])
df
Output is:
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 53
To avoid this some guy from here suggest this ad-hoc:
def correct(date_):
year, week = date_.year, date_.isocalendar()[1]
ret = datetime.strptime('%04d-%02d-1' % (year, week), '%Y-%W-%w')
if date(year, 1, 4).isoweekday() > 4:
ret -= timedelta(days=7)
return ret.isocalendar()[1]
df['C']= df['B'].astype('str').apply(lambda date: correct(date_from_str(date,'%Y%m%d')))
Then, output will be:
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 1

Related

Groupby selected rows by a condition on a column value and then transform another column

This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4

Python - Finding longest continuous run in Pandas Dataframe

I have a pandas dataframe which has the following variables: week, product code, constraint flag (0 or 1 indicating whether the product supply was constrained).
Week Product_Code Constraint_Flag
1 A 1
1 B 0
2 A 0
2 B 1
3 A 0
3 B 0
4 A 0
4 B 0
5 A 1
5 B 0
I want to find the longest time period that the supply was unconstrained for i.e. the longest string of 0s for each product code. So for product A I would want to know that the longest string started in week 3 and lasted for 2 weeks, and for product B the longest string started in week 3 and lasted for 3 weeks.
How can I do this?
Use this solution for find longest only 0 period and then filter with aggregate first and last:
m = np.concatenate(( [True], df['Constraint_Flag'] != 0, [True] ))
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)
s, e = ss[(ss[:,1] - ss[:,0]).argmax()]
pos = df.columns.get_loc('Week')
print (s,e)
4 8
print (df.iloc[s:e])
Week Product_Code Constraint_Flag
4 3 A 0
5 3 B 0
6 4 A 0
7 4 B 0
df = df.iloc[s:e].groupby('Product_Code')['Week'].agg(['first','last'])
print (df)
first last
Product_Code
A 3 4
B 3 4
But if need compare per groups:
def f(x):
print (x)
m = np.concatenate(( [True], x['Constraint_Flag'] != 0, [True] ))
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)
s, e = ss[(ss[:,1] - ss[:,0]).argmax()]
pos = x.columns.get_loc('Week')
c = ['start','end']
return pd.Series([x.iat[s, pos], x.iat[e-1, pos]], index=c)
Week Product_Code Constraint_Flag
0 1 A 1
2 2 A 0
4 3 A 0
6 4 A 0
8 5 A 1
Week Product_Code Constraint_Flag
1 1 B 0
3 2 B 1
5 3 B 0
7 4 B 0
9 5 B 0
df = df.groupby('Product_Code').apply(f)
print (df)
start end
Product_Code
A 2 4
B 3 5

Pandas convert string to end of month date

I have this problem where 1 of the columns in my df is entered in as string, but I want to convert it into end of date month in python. For example,
Id Name Date Number
0 1 A 201601 5
1 2 B 201602 6
2 3 C 201603 4
The Date column has the year and month as string. Ideally, my goal is:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4
I was able to do this on excel using Endmonth and cut string, but when I tried pd.to_datetime in python, it didn't work. Thanks!
we can using MonthEnd
from pandas.tseries.offsets import MonthEnd
df.Date=(pd.to_datetime(df.Date,format='%Y%m')+MonthEnd(1)).dt.strftime('%m/%d/%Y')
df
Out[1336]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4
you can use PeriodIndex:
In [36]: df['Date'] = pd.PeriodIndex(df['Date'].astype(str), freq='M').strftime('%m/%d/%Y')
In [37]: df
Out[37]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4

pandas dataframe apply using additional arguments

with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))

Set value of first item in slice in python pandas

So I would like make a slice of a dataframe and then set the value of the first item in that slice without copying the dataframe. For example:
df = pandas.DataFrame(numpy.random.rand(3,1))
df[df[0]>0][0] = 0
The slice here is irrelevant and just for the example and will return the whole data frame again. Point being, by doing it like it is in the example you get a setting with copy warning (understandably). I have also tried slicing first and then using ILOC/IX/LOC and using ILOC twice, i.e. something like:
df.iloc[df[0]>0,:][0] = 0
df[df[0]>0,:].iloc[0] = 0
And neither of these work. Again- I don't want to make a copy of the dataframe even if it id just the sliced version.
EDIT:
It seems there are two ways, using a mask or IdxMax. The IdxMax method seems to work if your index is unique, and the mask method if not. In my case, the index is not unique which I forgot to mention in the initial post.
I think you can use idxmax for get index of first True value and then set by loc:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print (df)
0
0 1
1 3
2 0
3 0
4 3
print ((df[0] == 0).idxmax())
2
df.loc[(df[0] == 0).idxmax(), 0] = 100
print (df)
0
0 1
1 3
2 100
3 0
4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
print (df)
0
0 1
1 200
2 0
3 0
4 3
EDIT:
Solution with not unique index:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df = df.reset_index()
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.set_index('index')
df.index.name = None
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT1:
Solution with MultiIndex:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df.index = [np.arange(len(df.index)), df.index]
print (df)
0
0 1 1
1 2 3
2 2 0
3 3 0
4 4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.reset_index(level=0, drop=True)
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT2:
Solution with double cumsum:
np.random.seed(1)
df = pd.DataFrame([4,0,4,7,4], index=[1,2,2,3,4])
print (df)
0
1 4
2 0
2 4
3 7
4 4
mask = (df[0] == 0).cumsum().cumsum()
print (mask)
1 0
2 1
2 2
3 3
4 4
Name: 0, dtype: int32
df.loc[mask == 1, 0] = 200
print (df)
0
1 4
2 200
2 4
3 7
4 4
Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3, 4, 5]))
print(df)
A
0 1
1 2
2 3
3 4
4 5
Create some arbitrary slice slc
slc = df[df.A > 2]
print(slc)
A
2 3
3 4
4 5
Access the first row of slc within df by using index[0] and loc
df.loc[slc.index[0]] = 0
print(df)
A
0 1
1 2
2 0
3 4
4 5
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(6,1),index=[1,2,2,3,3,3])
df[1] = 0
df.columns=['a','b']
df['b'][df['a']>=0.5]=1
df=df.sort(['b','a'],ascending=[0,1])
df.loc[df[df['b']==0].index.tolist()[0],'a']=0
In this method extra copy of the dataframe is not created but an extra column is introduced which can be dropped after processing. To choose any index instead o the first one you can change the last line as follows
df.loc[df[df['b']==0].index.tolist()[n],'a']=0
to change any nth item in a slice
df
a
1 0.111089
2 0.255633
2 0.332682
3 0.434527
3 0.730548
3 0.844724
df after slicing and labelling them
a b
1 0.111089 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
3 0.730548 1
3 0.844724 1
After changing value of first item in slice (labelled as 0) to 0
a b
3 0.730548 1
3 0.844724 1
1 0.000000 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
So using some of the answers I managed to find a one liner way to do this:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print df
0
0 1
1 3
2 0
3 0
4 3
df.loc[(df[0] == 0).cumsum()==1,0] = 1
0
0 1
1 3
2 1
3 0
4 3
Essentially this is using the mask inline with a cumsum.

Categories

Resources