So using python in Jupyter notebook, I have this data df with a column named as "Month"(more than 100,000 rows) having individual numbers upto 12. I want to create another column in that same data set named as "Quarters" so that it can display Quarters for those respective months.
I extracted month from "review_time" Column using ".dt.strftime('%m')"
I am sorry if the provided information was not enough. New to stack flow.
So I extracted month from column name :"date". I created a variable a and then added that variable a to the main table.
a = df['review_time'].dt.strftime('%m')
df.insert(2, "month",a, True)
this is the output for month.info() column
<class 'pandas.core.series.Series'>
Int64Index: 965093 entries, 1 to 989508
Series name: month
Non-Null Count Dtype
-------------- -----
965093 non-null object
dtypes: object(1)
memory usage: 14.7+ MB
You could use pandas.cut
Example with a generic dataframe:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'Month': [1,2,3,4,5,6,7,8,9,10,11,12]})
df['Quarter'] = pd.cut(df['Month'], [0,3,6,9,12], labels = [1,2,3,4])
print(df)
This prints:
Month Quarter
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
8 9 3
9 10 4
10 11 4
11 12 4
An alternative is to calculate the quarter from the month number. qtr = ( month -1 ) // 3 + 1
import numpy as np
import pandas as pd
from datetime import datetime
# lo and hi used to generate random dates in 2022
lo = datetime( 2022, 1, 1 ).toordinal()
hi = datetime( 2022, 12, 31 ).toordinal()
np.random.seed( 1234 )
dates = [ datetime.fromordinal( np.random.randint( lo, hi )) for _ in range( 20 )]
df = pd.DataFrame( { 'Date': dates } )
df['Qtr'] = ( df['Date'].dt.month - 1 ) // 3 + 1
print( df )
Result
Date Qtr
0 2022-10-31 4
1 2022-07-31 3
2 2022-10-22 4
3 2022-02-23 1
4 2022-07-24 3
5 2022-06-02 2
6 2022-05-24 2
7 2022-06-27 2
8 2022-10-07 4
9 2022-08-22 3
10 2022-06-04 2
11 2022-01-31 1
12 2022-06-21 2
13 2022-06-08 2
14 2022-08-25 3
15 2022-10-10 4
16 2022-05-01 2
17 2022-11-22 4
18 2022-12-03 4
19 2022-09-04 3
Related
Here is data
id
date
population
1
2021-5
21
2
2021-5
22
3
2021-5
23
4
2021-5
24
1
2021-4
17
2
2021-4
24
3
2021-4
18
4
2021-4
29
1
2021-3
20
2
2021-3
29
3
2021-3
17
4
2021-3
22
I want to calculate the monthly change regarding population in each id. so result will be:
id
date
delta
1
5
.2353
1
4
-.15
2
5
-.1519
2
4
-.2083
3
5
.2174
3
4
.0556
4
5
-.2083
4
4
.3182
delta := (this month - last month) / last month
How to approach this in pandas? I'm thinking of groupby but don't know what to do next
remember there might be more dates. but results is always
Use GroupBy.pct_change with sorting columns first before, last remove misisng rows by column delta:
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date'], ascending=[True, False])
df['delta'] = df.groupby('id')['population'].pct_change(-1)
df = df.dropna(subset=['delta'])
print (df)
id date population delta
0 1 2021-05-01 21 0.235294
4 1 2021-04-01 17 -0.150000
1 2 2021-05-01 22 -0.083333
5 2 2021-04-01 24 -0.172414
2 3 2021-05-01 23 0.277778
6 3 2021-04-01 18 0.058824
3 4 2021-05-01 24 -0.172414
7 4 2021-04-01 29 0.318182
Try this:
df.groupby('id')['population'].rolling(2).apply(lambda x: (x.iloc[0] - x.iloc[1]) / x.iloc[0]).dropna()
maybe you could try something like:
data['delta'] = data['population'].diff()
data['delta'] /= data['population']
with this approach the first line would be NaNs, but for the rest, this should work.
I have two table like that:
Customr Issue Date_Issue
1 1 01/01/2019
1 2 03/06/2019
1 3 04/07/2019
1 4 13/09/2019
2 5 01/02/2019
2 6 16/03/2019
2 7 20/08/2019
2 8 30/08/2019
2 9 01/09/2019
3 10 01/02/2019
3 11 03/02/2019
3 12 05/03/2019
3 13 20/04/2019
3 14 25/04/2019
3 15 13/05/2019
3 16 20/05/2019
3 17 25/05/2019
3 18 01/06/2019
3 19 03/07/2019
3 20 20/08/2019
Customr Date_Survey df_Score
1 06/04/2019 10
2 10/06/2019 9
3 01/08/2019 3
And I need to obtain the number of issues of each customer in the three month before the date of survey.
But I can not get this query in Pandas.
#first table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
#And second table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
I expect the result
Custr Date_Survey Score Count_issues
1 06/04/2019 10 0
2 10/06/2019 9 1
3 01/08/2019 3 5
Use:
#convert columns to datetimes
df1['Date_Issue'] = pd.to_datetime(df1['Date_Issue'], dayfirst=True)
df2['Date_Survey'] = pd.to_datetime(df2['Date_Survey'], dayfirst=True)
#create datetimes for 3 months before
df2['Date1'] = df2['Date_Survey'] - pd.offsets.DateOffset(months=3)
#merge together
df = df1.merge(df2, on='Customr')
#filter by between, select only Customr and get counts
s = df.loc[df['Date_Issue'].between(df['Date1'], df['Date_Survey']), 'Customr'].value_counts()
#map to new column and replace NaNs to 0
df2['Count_issues'] = df2['Customr'].map(s).fillna(0, downcast='int')
print (df2)
Customr Date_Survey df_Score Date1 Count_issues
0 1 2019-04-06 10 2019-01-06 0
1 2 2019-06-10 9 2019-03-10 1
2 3 2019-08-01 3 2019-05-01 5
I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37
I have loaded a pandas dataframe from a .csv file that contains a column having datetime values.
df = pd.read_csv('data.csv')
The name of the column having the datetime values is pickup_datetime. Here's what I get if i do df['pickup_datetime'].head():
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Name: pickup_datetime, dtype: datetime64[ns, UTC]
How do I convert this column into a numpy array having only the day values of the datetime? For example: 15 from 0 2009-06-15 17:26:00+00:00, 05 from 1 2010-01-05 16:52:00+00:00, etc..
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
df['pickup_datetime'].dt.day.values
# array([15, 5, 18, 21, 9])
Just adding another Variant, although coldspeed already provide the briefed answer as a x-mas and New year bonus :-) :
>>> df
pickup_datetime
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Convert the strings to timestamps by inferring their format:
>>> df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
>>> df
pickup_datetime
0 2009-06-15 17:26:00
1 2010-01-05 16:52:00
2 2011-08-18 00:35:00
3 2012-04-21 04:30:00
4 2010-03-09 07:51:00
You can pic the day's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.day
0 15
1 5
2 18
3 21
4 9
Name: pickup_datetime, dtype: int64
You can pic the month's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.month
0 6
1 1
2 8
3 4
4 3
You can pic the Year's only from the pickup_datetime
>>> df['pickup_datetime'].dt.year
0 2009
1 2010
2 2011
3 2012
4 2010
I have a dataframe that looks similar to the following:
df = pd.DataFrame({'Y_M':['201710','201711','201712'],'1':[1,5,9],'2':[2,6,10],'3':[3,7,11],'4':[4,8,12]})
df = df.set_index('Y_M')
Which creates a dataframe looking like this:
1 2 3 4
Y_M
201711 1 2 3 4
201712 5 6 7 8
201713 9 10 11 12
The columns are the day of the month. They stretch on to the right, going all the way up to 31. (February will have columns 29, 30, and 31 filled with NaN).
The index contains the year and the month (e.g. 201711 referring to Nov 2017)
My question is: How can I make this a single series, with the year/month/day combined? My output would be the following:
Y_M
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
The index can be converted to a datetime. In fact I think it would make it easier.
Use stack for Series and then combine datetimes by to_datetime with timedeltas by
to_timedelta:
df = df.stack()
df.index = pd.to_datetime(df.index.get_level_values(0), format='%Y%m') + \
pd.to_timedelta(df.index.get_level_values(1).astype(int) - 1, unit='D')
print (df)
2017-10-01 1
2017-10-02 2
2017-10-03 3
2017-10-04 4
2017-11-01 5
2017-11-02 6
2017-11-03 7
2017-11-04 8
2017-12-01 9
2017-12-02 10
2017-12-03 11
2017-12-04 12
dtype: int64
print (df.index)
DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10-03', '2017-10-04',
'2017-11-01', '2017-11-02', '2017-11-03', '2017-11-04',
'2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04'],
dtype='datetime64[ns]', freq=None)
Last if necessary strings in index (not DatetimeIndex) add DatetimeIndex.strftime:
df.index = df.index.strftime('%Y%m%d')
print (df)
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
dtype: int64
print (df.index)
Index(['20171001', '20171002', '20171003', '20171004', '20171101', '20171102',
'20171103', '20171104', '20171201', '20171202', '20171203', '20171204'],
dtype='object')
Without bringing date into it.
s = df.stack()
s.index = s.index.map('{0[0]}{0[1]:>02s}'.format)
s
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
dtype: int64