Hello guys i have a data which have two cloumns so want to generate unique sequence of id for that...
This is data:
Year Month
0 2010 Jan
1 2010 Feb
2 2010 Mar
3 2010 Mar
4 2010 Mar
I want to join that service id to these two column for that i have write a code:
data['Sr_ID'] = data.groupby(['Month','Year']).ngroup()
data.head()
this give this output:
Year Month Sr_ID
0 2010 Jan 20
1 2010 Feb 15
2 2010 Mar 35
3 2010 Mar 35
4 2010 Mar 35
but i don't want "Sr_ID" like this i want to be like "Sr_0001...Sr_0002"
it should be in a sequence of number this "Sr" so for this
I want a output like this:
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005
I want to generate different id for different row because I have 8 columns, with no repeated rows.
np.arange + str.zfill
You can use a range, then pad with zeros to the left:
df['Sr_ID'] = 'Sr_' + pd.Series(np.arange(1, len(df.index)+1)).astype(str).str.zfill(4)
print(df)
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005
Related
I want to compare the two columns (date1 and date2) for same ID and set the value of column NewColumn to 'Yes' if date1 matches with the previous date2.
INPUT:
ID
Date1
date2
NewColumn
1
31 Jan 2022
1 Feb 2022
1
1 Feb 2022
2 Feb 2022
1
7 Feb 2022
8 Feb 2022
2
2 Feb 2022
2 Feb 2022
3
2 Feb 2022
3 Feb 2022
Input in CSV format:
ID,date1,date2,NewColumn
1,31/01/2022,01/02/2022,
1,01/02/2022,02/02/2022,
1,07/02/2022,08/02/2022,
2,02/02/2022,02/02/2022,
3,02/02/2022,03/02/2022,
Output:
ID
date1
date2
NewColumn
1
31 Jan 2022
1 Feb 2022
1
1 Feb 2022
2 Feb 2022
YES
1
7 Feb 2022
8 Feb 2022
2
2 Feb 2022
2 Feb 2022
3
2 Feb 2022
3 Feb 2022
In CSV format:
ID,date1,date2,NewColumn
1,31/01/2022,01/02/2022,
1,01/02/2022,02/02/2022, YES
1,07/02/2022,08/02/2022,
2,02/02/2022,02/02/2022,
3,02/02/2022,03/02/2022,
You can use groupby and apply to apply a custom function to each group. The function then needs to compare the date1 with the previous row's date2 which can be done using shift. This will give a boolean value (True or False), to get a string value you can use np.where. For example:
import numpy as np
def func(x):
return x['date1'] == x['date2'].shift(1)
df['NewColumn'] = np.where(df.groupby('ID').apply(func), 'YES', '')
Result:
ID date1 date2 NewColumn
0 1 31/01/2022 01/02/2022
1 1 01/02/2022 02/02/2022 YES
2 1 07/02/2022 08/02/2022
3 2 02/02/2022 02/02/2022
4 3 02/02/2022 03/02/2022
I'm trying to sort a grouped data using Pandas
My code :
df = pd.read_csv("./data3.txt")
grouped = df.groupby(['cust','year','month'])['price'].count()
print(grouped)
My data:
cust,year,month,price
astor,2015,Jan,100
astor,2015,Jan,122
astor,2015,Feb,200
astor,2016,Feb,234
astor,2016,Feb,135
astor,2016,Mar,169
astor,2017,Mar,321
astor,2017,Apr,245
tor,2015,Jan,100
tor,2015,Feb,122
tor,2015,Feb,200
tor,2016,Mar,234
tor,2016,Apr,135
tor,2016,May,169
tor,2017,Mar,321
tor,2017,Apr,245
This is my result.
cust year month
astor 2015 Feb 1
Jan 2
2016 Feb 2
Mar 1
2017 Apr 1
Mar 1
tor 2015 Feb 2
Jan 1
2016 Apr 1
Mar 1
May 1
2017 Apr 1
Mar 1
How to get output sorted by month?
Add parameter sort=False to groupby:
grouped = df.groupby(['cust','year','month'], sort=False)['price'].count()
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
If not possible use first solution is possible convert months to datetimes and last convert back:
df['month'] = pd.to_datetime(df['month'], format='%b')
f = lambda x: x.strftime('%b')
grouped = df.groupby(['cust','year','month'])['price'].count().rename(f, level=2)
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
I have some data in dataframe and want to check if the Year is valid or not if present in between start_year AND end_year
Year start_year end_year
2010 2012 2014
2013 2012 2015
2015 2015 2016
2009 2010 2012
2017 2016 2019
I want to add one more column (valid/invalid) specifying that the Year is valid or not
Year start_year end_year valid/invalid
2010 2012 2014 invalid
2013 2012 2015 valid
2015 2015 2016 valid
2009 2010 2012 invalid
2017 2016 2019 valid
How can we achieve this using python?
You can use np.where with Series.between
df["valid/invalid"] = np.where(df.Year.between(df.start_year,df.end_year),'valid','invalid')
df
Year start_year end_year valid/invalid
0 2010 2012 2014 invalid
1 2013 2012 2015 valid
2 2015 2015 2016 valid
3 2009 2010 2012 invalid
4 2017 2016 2019 valid
Check np.where
df['v/inv'] = np.where((df.Year>=df.start_year) & (df.Year<=df.end_year), 'valid','invalid')
df
Out[360]:
Year start_year end_year v/inv
0 2010 2012 2014 invalid
1 2013 2012 2015 valid
2 2015 2015 2016 valid
3 2009 2010 2012 invalid
4 2017 2016 2019 valid
If you want to stick to only using Pandas, then try the following solution which uses apply and replace -
df['valid/invalid'] = df.apply(lambda x: (x.Year>=x.start_year) and (x.Year<=x.end_year), axis=1).replace({True:'Valid',False:'Invalid'})
Year start_year end_year valid/invalid
0 2010 2012 2014 Invalid
1 2013 2012 2015 Valid
2 2015 2015 2016 Valid
3 2009 2010 2012 Invalid
4 2017 2016 2019 Valid
The first apply step gets you True or False if the year is in between (inclusive on both ends) the start and end year. Second step replaces the True and False with Valid or Invalid strings.
I have a dataframe with a column that looks like this
Other via Other on 17 Jan 2019
Other via Other on 17 Jan 2019
Interview via E-mail on 14 Dec 2018
Rejected via E-mail on 15 Jan 2019
Rejected via E-mail on 15 Jan 2019
Rejected via E-mail on 15 Jan 2019
Rejected via E-mail on 15 Jan 2019
Interview via E-mail on 14 Jan 2019
Rejected via Website on 12 Jan 2019
Is it possible to split this column into two, one is whatever before the "via" and the other is whatever after the "on"? Thank you!
Use str.extract
df[['col1', 'col2']] = df.col.str.extract('(.*)\svia.*on\s(.*)', expand = True)
col1 col2
0 Other 17 Jan 2019
1 Other 17 Jan 2019
2 Interview 14 Dec 2018
3 Rejected 15 Jan 2019
4 Rejected 15 Jan 2019
5 Rejected 15 Jan 2019
6 Rejected 15 Jan 2019
7 Interview 14 Jan 2019
8 Rejected 12 Jan 2019
You can pretty much use split() as df.col.str.split('via|on',expand=True)[[0,2]:
Lets details it out........
Reproducing Your DataFrame:
>>> df
col
0 Other via Other on 17 Jan 2019
1 Other via Other on 17 Jan 2019
2 Interview via E-mail on 14 Dec 2018
3 Rejected via E-mail on 15 Jan 2019
4 Rejected via E-mail on 15 Jan 2019
5 Rejected via E-mail on 15 Jan 2019
6 Rejected via E-mail on 15 Jan 2019
7 Interview via E-mail on 14 Jan 2019
8 Rejected via Website on 12 Jan 2019
Let's looks at here First splitting the whole column based on the our required strings via and on which will split the entire column col into three distinct separated columns 0 1 2 where 0 will be before the string via & 2 will be after string on and rest will be middle one which is column 1 which we don't require.
So, we can take liberty and only opt for columns 0 & 2 as follows.
>>> df.col.str.split('via|on',expand=True)[[0,2]]
0 2
0 Other 17 Jan 2019
1 Other 17 Jan 2019
2 Interview 14 Dec 2018
3 Rejected 15 Jan 2019
4 Rejected 15 Jan 2019
5 Rejected 15 Jan 2019
6 Rejected 15 Jan 2019
7 Interview 14 Jan 2019
8 Rejected 12 Jan 2019
Better do it assign a new dataframe and the rename the columns:
Result:
newdf = df.col.str.split('via|on',expand=True)[[0,2]]
newdf.rename(columns={0: 'col1', 2: 'col2'}, inplace=True)
print(newdf)
col1 col2
0 Other 17 Jan 2019
1 Other 17 Jan 2019
2 Interview 14 Dec 2018
3 Rejected 15 Jan 2019
4 Rejected 15 Jan 2019
5 Rejected 15 Jan 2019
6 Rejected 15 Jan 2019
7 Interview 14 Jan 2019
8 Rejected 12 Jan 2019
In Pandas I'm trying to edit a column Year in a dataframe by checking column Age containing dates such as Mon Dec 28 11:19:42 CST 2007.
ID Age Year
1 Mon Dec 28 11:19:42 CST 2007 NaN
2 Tue Sep 28 12:39:41 CST 2008 NaN
I'm trying to do this by using df.loc[df[df.Age.str.contains("2007")], 'Year'] = 2007, however, this returns the error ValueError: cannot copy sequence with size 20 to array axis with dimension 11359
Expected result:
ID Age Year
1 Mon Dec 28 11:19:42 CST 2007 2007
2 Tue Sep 28 12:39:41 CST 2008 NaN
df[df['Age'].str.contains("2007")]['Year'] = 2007 also does not work. Can anyone help me out how I could do this properly?
Thanks in advance!
You can use str.endswith with loc:
df.loc[df.Age.str.endswith("2007"), 'Year'] = 2007
print (df)
ID Age Year
0 1 Mon Dec 28 11:19:42 CST 2007 2007.0
1 2 Tue Sep 28 12:39:41 CST 2008 NaN
Or str.contains:
df.loc[df.Age.str.contains("2007"), 'Year'] = 2007
print (df)
ID Age Year
0 1 Mon Dec 28 11:19:42 CST 2007 2007.0
1 2 Tue Sep 28 12:39:41 CST 2008 NaN
Another possible solution by mask:
df.Year = df.Year.mask(df.Age.str.endswith("2007"), 2007)
print (df)
ID Age Year
0 1 Mon Dec 28 11:19:42 CST 2007 2007.0
1 2 Tue Sep 28 12:39:41 CST 2008 NaN