Alter the content of a pandas column by regular expression - python

I have a dataframe with a column that looks like this
Other via Other on 17 Jan 2019
Other via Other on 17 Jan 2019
Interview via E-mail on 14 Dec 2018
Rejected via E-mail on 15 Jan 2019
Rejected via E-mail on 15 Jan 2019
Rejected via E-mail on 15 Jan 2019
Rejected via E-mail on 15 Jan 2019
Interview via E-mail on 14 Jan 2019
Rejected via Website on 12 Jan 2019
Is it possible to split this column into two, one is whatever before the "via" and the other is whatever after the "on"? Thank you!

Use str.extract
df[['col1', 'col2']] = df.col.str.extract('(.*)\svia.*on\s(.*)', expand = True)
col1 col2
0 Other 17 Jan 2019
1 Other 17 Jan 2019
2 Interview 14 Dec 2018
3 Rejected 15 Jan 2019
4 Rejected 15 Jan 2019
5 Rejected 15 Jan 2019
6 Rejected 15 Jan 2019
7 Interview 14 Jan 2019
8 Rejected 12 Jan 2019

You can pretty much use split() as df.col.str.split('via|on',expand=True)[[0,2]:
Lets details it out........
Reproducing Your DataFrame:
>>> df
col
0 Other via Other on 17 Jan 2019
1 Other via Other on 17 Jan 2019
2 Interview via E-mail on 14 Dec 2018
3 Rejected via E-mail on 15 Jan 2019
4 Rejected via E-mail on 15 Jan 2019
5 Rejected via E-mail on 15 Jan 2019
6 Rejected via E-mail on 15 Jan 2019
7 Interview via E-mail on 14 Jan 2019
8 Rejected via Website on 12 Jan 2019
Let's looks at here First splitting the whole column based on the our required strings via and on which will split the entire column col into three distinct separated columns 0 1 2 where 0 will be before the string via & 2 will be after string on and rest will be middle one which is column 1 which we don't require.
So, we can take liberty and only opt for columns 0 & 2 as follows.
>>> df.col.str.split('via|on',expand=True)[[0,2]]
0 2
0 Other 17 Jan 2019
1 Other 17 Jan 2019
2 Interview 14 Dec 2018
3 Rejected 15 Jan 2019
4 Rejected 15 Jan 2019
5 Rejected 15 Jan 2019
6 Rejected 15 Jan 2019
7 Interview 14 Jan 2019
8 Rejected 12 Jan 2019
Better do it assign a new dataframe and the rename the columns:
Result:
newdf = df.col.str.split('via|on',expand=True)[[0,2]]
newdf.rename(columns={0: 'col1', 2: 'col2'}, inplace=True)
print(newdf)
col1 col2
0 Other 17 Jan 2019
1 Other 17 Jan 2019
2 Interview 14 Dec 2018
3 Rejected 15 Jan 2019
4 Rejected 15 Jan 2019
5 Rejected 15 Jan 2019
6 Rejected 15 Jan 2019
7 Interview 14 Jan 2019
8 Rejected 12 Jan 2019

Related

How do I parse value as datetime?

I have dates as below:
date
0 Today, 12 Mar
1 Tomorrow, 13 Mar
2 Tomorrow, 13 Mar
3 Tomorrow, 13 Mar
4 Tomorrow, 13 Mar
5 14 Mar 2021
6 14 Mar 2021
7 14 Mar 2021
8 14 Mar 2021
9 15 Mar 2021
How do I parse it as datetime in pandas?
Your date contains 'Today' and 'Tomorrow' which is not a valid format(if it is valid then I don't know I never worked with this type of format) of datetime so firstly replace them to 2021(if year is fixed...i.e 2021):-
df['date']=df['date'].str.replace('Today','2021')
df['date']=df['date'].str.replace('Tomorrow','2021')
Now just use to_datetime() method:-
df['date']=pd.to_datetime(df['date'])

Sorting grouped data in Pandas

I'm trying to sort a grouped data using Pandas
My code :
df = pd.read_csv("./data3.txt")
grouped = df.groupby(['cust','year','month'])['price'].count()
print(grouped)
My data:
cust,year,month,price
astor,2015,Jan,100
astor,2015,Jan,122
astor,2015,Feb,200
astor,2016,Feb,234
astor,2016,Feb,135
astor,2016,Mar,169
astor,2017,Mar,321
astor,2017,Apr,245
tor,2015,Jan,100
tor,2015,Feb,122
tor,2015,Feb,200
tor,2016,Mar,234
tor,2016,Apr,135
tor,2016,May,169
tor,2017,Mar,321
tor,2017,Apr,245
This is my result.
cust year month
astor 2015 Feb 1
Jan 2
2016 Feb 2
Mar 1
2017 Apr 1
Mar 1
tor 2015 Feb 2
Jan 1
2016 Apr 1
Mar 1
May 1
2017 Apr 1
Mar 1
How to get output sorted by month?
Add parameter sort=False to groupby:
grouped = df.groupby(['cust','year','month'], sort=False)['price'].count()
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
If not possible use first solution is possible convert months to datetimes and last convert back:
df['month'] = pd.to_datetime(df['month'], format='%b')
f = lambda x: x.strftime('%b')
grouped = df.groupby(['cust','year','month'])['price'].count().rename(f, level=2)
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64

Elementwise multiplication of pandas Dataframes with different indices

I have two pandas dataframes:
df1=pd.DataFrame({'month':['jun', 'jul', 'aug'],'a':[3,4,5], 'b':[2,3,4], 'c':[4,5,5]}).set_index('month')
a b c
month
jun 3 2 4
jul 4 3 5
aug 5 4 5
and
df2=pd.DataFrame({'year':[2009,2009,2009, 2010,2010,2010,2011,2011,2011],'month':['jun', 'jul', 'aug','jun', 'jul', 'aug','jun', 'jul', 'aug'],'a':[2,2,2,2,2,2,2,2,2], 'b':[1,2,3,4,5,6,7,8,9], 'c':[3,3,3,3,3,3,3,3,3]}).set_index('year')
month a b c
year
2009 jun 2 1 3
2009 jul 2 2 3
2009 aug 2 3 3
2010 jun 2 4 3
2010 jul 2 5 3
2010 aug 2 6 3
2011 jun 2 7 3
2011 jul 2 8 3
2011 aug 2 9 3
I would like to multiply df2's elements with df1's according to the months. Is there quick way to do it?
Thanks in adavance.
Use DataFrame.mul by months converted to MultiIndex by DataFrame.set_index
:
df = df2.set_index('month', append=True).mul(df1, level=1).reset_index(level=1)
print (df)
month a b c
year
2009 jun 6 2 12
2009 jul 8 6 15
2009 aug 10 12 15
2010 jun 6 8 12
2010 jul 8 15 15
2010 aug 10 24 15
2011 jun 6 14 12
2011 jul 8 24 15
2011 aug 10 36 15

how to generate a unique service id number in python using dataframe

Hello guys i have a data which have two cloumns so want to generate unique sequence of id for that...
This is data:
Year Month
0 2010 Jan
1 2010 Feb
2 2010 Mar
3 2010 Mar
4 2010 Mar
I want to join that service id to these two column for that i have write a code:
data['Sr_ID'] = data.groupby(['Month','Year']).ngroup()
data.head()
this give this output:
Year Month Sr_ID
0 2010 Jan 20
1 2010 Feb 15
2 2010 Mar 35
3 2010 Mar 35
4 2010 Mar 35
but i don't want "Sr_ID" like this i want to be like "Sr_0001...Sr_0002"
it should be in a sequence of number this "Sr" so for this
I want a output like this:
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005
I want to generate different id for different row because I have 8 columns, with no repeated rows.
np.arange + str.zfill
You can use a range, then pad with zeros to the left:
df['Sr_ID'] = 'Sr_' + pd.Series(np.arange(1, len(df.index)+1)).astype(str).str.zfill(4)
print(df)
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005

Percentage change with groupby python

I have the following dataframe:
Year Month Booked
0 2016 Aug 55999.0
6 2017 Aug 60862.0
1 2016 Jul 54062.0
7 2017 Jul 58417.0
2 2016 Jun 42044.0
8 2017 Jun 48767.0
3 2016 May 39676.0
9 2017 May 40986.0
4 2016 Oct 39593.0
10 2017 Oct 41439.0
5 2016 Sep 49677.0
11 2017 Sep 53969.0
I want to obtain the percentage change with respect to the same month from last year. I have tried the following code:
df['pct_ch'] = df.groupby(['Month','Year'])['Booked'].pct_change()
but I get the following, which is not at all what I want:
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 -0.111728
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 -0.280278
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 -0.186417
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 -0.033987
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 0.198798
11 2017 Sep 53969.0 0.086398
Do not groupby Year otherwise you won't get, for instance, Aug 2017 and Aug 2016 together. Also, use transform to broadcast back results to original indices
Try:
df['pct_ch'] = df.groupby(['Month'])['Booked'].transform(lambda s: s.pct_change())
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 NaN
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 NaN
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 NaN
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 NaN
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 NaN
11 2017 Sep 53969.0 0.086398

Categories

Resources