Pandas changing column value based on substring in other column - python

In Pandas I'm trying to edit a column Year in a dataframe by checking column Age containing dates such as Mon Dec 28 11:19:42 CST 2007.
ID Age Year
1 Mon Dec 28 11:19:42 CST 2007 NaN
2 Tue Sep 28 12:39:41 CST 2008 NaN
I'm trying to do this by using df.loc[df[df.Age.str.contains("2007")], 'Year'] = 2007, however, this returns the error ValueError: cannot copy sequence with size 20 to array axis with dimension 11359
Expected result:
ID Age Year
1 Mon Dec 28 11:19:42 CST 2007 2007
2 Tue Sep 28 12:39:41 CST 2008 NaN
df[df['Age'].str.contains("2007")]['Year'] = 2007 also does not work. Can anyone help me out how I could do this properly?
Thanks in advance!

You can use str.endswith with loc:
df.loc[df.Age.str.endswith("2007"), 'Year'] = 2007
print (df)
ID Age Year
0 1 Mon Dec 28 11:19:42 CST 2007 2007.0
1 2 Tue Sep 28 12:39:41 CST 2008 NaN
Or str.contains:
df.loc[df.Age.str.contains("2007"), 'Year'] = 2007
print (df)
ID Age Year
0 1 Mon Dec 28 11:19:42 CST 2007 2007.0
1 2 Tue Sep 28 12:39:41 CST 2008 NaN
Another possible solution by mask:
df.Year = df.Year.mask(df.Age.str.endswith("2007"), 2007)
print (df)
ID Age Year
0 1 Mon Dec 28 11:19:42 CST 2007 2007.0
1 2 Tue Sep 28 12:39:41 CST 2008 NaN

Related

Comparing two different column values for the records having same primary key value

I want to compare the two columns (date1 and date2) for same ID and set the value of column NewColumn to 'Yes' if date1 matches with the previous date2.
INPUT:
ID
Date1
date2
NewColumn
1
31 Jan 2022
1 Feb 2022
1
1 Feb 2022
2 Feb 2022
1
7 Feb 2022
8 Feb 2022
2
2 Feb 2022
2 Feb 2022
3
2 Feb 2022
3 Feb 2022
Input in CSV format:
ID,date1,date2,NewColumn
1,31/01/2022,01/02/2022,
1,01/02/2022,02/02/2022,
1,07/02/2022,08/02/2022,
2,02/02/2022,02/02/2022,
3,02/02/2022,03/02/2022,
Output:
ID
date1
date2
NewColumn
1
31 Jan 2022
1 Feb 2022
1
1 Feb 2022
2 Feb 2022
YES
1
7 Feb 2022
8 Feb 2022
2
2 Feb 2022
2 Feb 2022
3
2 Feb 2022
3 Feb 2022
In CSV format:
ID,date1,date2,NewColumn
1,31/01/2022,01/02/2022,
1,01/02/2022,02/02/2022, YES
1,07/02/2022,08/02/2022,
2,02/02/2022,02/02/2022,
3,02/02/2022,03/02/2022,
You can use groupby and apply to apply a custom function to each group. The function then needs to compare the date1 with the previous row's date2 which can be done using shift. This will give a boolean value (True or False), to get a string value you can use np.where. For example:
import numpy as np
def func(x):
return x['date1'] == x['date2'].shift(1)
df['NewColumn'] = np.where(df.groupby('ID').apply(func), 'YES', '')
Result:
ID date1 date2 NewColumn
0 1 31/01/2022 01/02/2022
1 1 01/02/2022 02/02/2022 YES
2 1 07/02/2022 08/02/2022
3 2 02/02/2022 02/02/2022
4 3 02/02/2022 03/02/2022

Sorting grouped data in Pandas

I'm trying to sort a grouped data using Pandas
My code :
df = pd.read_csv("./data3.txt")
grouped = df.groupby(['cust','year','month'])['price'].count()
print(grouped)
My data:
cust,year,month,price
astor,2015,Jan,100
astor,2015,Jan,122
astor,2015,Feb,200
astor,2016,Feb,234
astor,2016,Feb,135
astor,2016,Mar,169
astor,2017,Mar,321
astor,2017,Apr,245
tor,2015,Jan,100
tor,2015,Feb,122
tor,2015,Feb,200
tor,2016,Mar,234
tor,2016,Apr,135
tor,2016,May,169
tor,2017,Mar,321
tor,2017,Apr,245
This is my result.
cust year month
astor 2015 Feb 1
Jan 2
2016 Feb 2
Mar 1
2017 Apr 1
Mar 1
tor 2015 Feb 2
Jan 1
2016 Apr 1
Mar 1
May 1
2017 Apr 1
Mar 1
How to get output sorted by month?
Add parameter sort=False to groupby:
grouped = df.groupby(['cust','year','month'], sort=False)['price'].count()
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
If not possible use first solution is possible convert months to datetimes and last convert back:
df['month'] = pd.to_datetime(df['month'], format='%b')
f = lambda x: x.strftime('%b')
grouped = df.groupby(['cust','year','month'])['price'].count().rename(f, level=2)
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64

How to calculate bank statement debit/credit columns using balance column of pandas dataframe?

I have one dataframe which looks like below:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN NaN 1500
2 15 Dec 2017 15 Dec 2017 NaN NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN NaN 1700
4 21 Dec 2017 21 Dec 2017 NaN NaN 2000
5 22 Dec 2017 22 Dec 2017 NaN NaN 1000
In the above dataframe "Bal" column contains balance values and want to fill up the DR/CR values based on the next "Bal" amount.
I did it using simple python but seems like pandas can perform this action in very intelligent manner.
Expected Output:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN 500 1500
2 15 Dec 2017 15 Dec 2017 300 NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN 500 1700
4 21 Dec 2017 21 Dec 2017 NaN 300 2000
5 22 Dec 2017 22 Dec 2017 1000 NaN 1000
You could use a pd.mask. First calculate the difference of the balance by using diff. By using mask, fill one column by its absolute value if it's negative, and mask the np.nan values in the other column where it's positive.
diff = df['Bal'].diff()
df['DR'] = df['DR'].mask(diff < 0, diff.abs())
df['CR'] = df['CR'].mask(diff > 0, diff)
#Output
# Date_1 Date_2 DR CR Bal
#0 5 Dec 2017 5 Dec 2017 500.0 NaN 1000
#1 14 Dec 2017 14 Dec 2017 NaN 500.0 1500
#2 15 Dec 2017 15 Dec 2017 300.0 NaN 1200
#3 18 Dec 2017 18 Dec 2017 NaN 500.0 1700
#4 21 Dec 2017 21 Dec 2017 NaN 300.0 2000
#5 22 Dec 2017 22 Dec 2017 1000.0 NaN 1000

how to generate a unique service id number in python using dataframe

Hello guys i have a data which have two cloumns so want to generate unique sequence of id for that...
This is data:
Year Month
0 2010 Jan
1 2010 Feb
2 2010 Mar
3 2010 Mar
4 2010 Mar
I want to join that service id to these two column for that i have write a code:
data['Sr_ID'] = data.groupby(['Month','Year']).ngroup()
data.head()
this give this output:
Year Month Sr_ID
0 2010 Jan 20
1 2010 Feb 15
2 2010 Mar 35
3 2010 Mar 35
4 2010 Mar 35
but i don't want "Sr_ID" like this i want to be like "Sr_0001...Sr_0002"
it should be in a sequence of number this "Sr" so for this
I want a output like this:
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005
I want to generate different id for different row because I have 8 columns, with no repeated rows.
np.arange + str.zfill
You can use a range, then pad with zeros to the left:
df['Sr_ID'] = 'Sr_' + pd.Series(np.arange(1, len(df.index)+1)).astype(str).str.zfill(4)
print(df)
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005

Working with dates in pandas

I have been collecting Twitter data for a couple of days now and, among other things, I need to analyze how content propagates. I created a list of timestamps when users were interested in content and imported twitter timestamps in pandas df with the column name 'timestamps'. It looks like this:
0 Sat Dec 14 05:13:28 +0000 2013
1 Sat Dec 14 05:21:12 +0000 2013
2 Sat Dec 14 05:23:10 +0000 2013
3 Sat Dec 14 05:27:54 +0000 2013
4 Sat Dec 14 05:37:43 +0000 2013
5 Sat Dec 14 05:39:38 +0000 2013
6 Sat Dec 14 05:41:39 +0000 2013
7 Sat Dec 14 05:43:46 +0000 2013
8 Sat Dec 14 05:44:50 +0000 2013
9 Sat Dec 14 05:47:33 +0000 2013
10 Sat Dec 14 05:49:29 +0000 2013
11 Sat Dec 14 05:55:03 +0000 2013
12 Sat Dec 14 05:59:09 +0000 2013
13 Sat Dec 14 05:59:45 +0000 2013
14 Sat Dec 14 06:17:19 +0000 2013
etc. What I want to do is to sample every 10min and count how many users are interested in content in each time frame. My problem is that I have no clue how to process the timestamps I imported from Twitter. Should I use regular expressions or is there any better approach to this? I would appreciate if someone could provide some pointers. Thanks!
That's ISO date format, it can be easily converted to datetime with pd.to_datetime:
>>> df[:2]
timestamp
0 Sat Dec 14 05:13:28 +0000 2013
1 Sat Dec 14 05:21:12 +0000 2013
>>> df['timestamp'] = pd.to_datetime(df['timestamp'])
>>> df[:2]
timestamp
0 2013-12-14 05:13:28
1 2013-12-14 05:21:12
To resample, you can make it an index, and use resample
>>> df.index = df['timestamp']
>>> df.resample('20Min', 'count')
2013-12-14 05:00:00 timestamp 1
2013-12-14 05:20:00 timestamp 5
2013-12-14 05:40:00 timestamp 8
2013-12-14 06:00:00 timestamp 1
dtype: int64

Categories

Resources