I would like to create a function to convert days to years.
In my data set, the Age of my customers is given in days instead of years.
I do not have any other information about their Age. (No DOB)
This is one of the solutions I have tried:
def convert (age_in_days):
Age = int[(age_in_days/365)]
return Age
convert (['age_in_days'])
and then calling the function with
convert(data['age_in_days'])
my output is:
[id
1 31.041096
2 83.038356
3 44.024658
4 65.021918
5 53.041096
...
79849 70.013699
79850 46.019178
79851 68.041096
79852 30.024658
79853 54.027397
Name: age_in_days, Length: 79853, dtype: float64]
as you can see I have a total of 79853 observations
when I try to convert to a data frame so I can append to my main data set,
pd.DataFrame(Age, columns = ['age'])
this is what I get:
age
age_in_days NaN
And nothing else, I need to add the calculated values of age to my data set.
I'm not sure if there is any requirement to write a separate function to convert days to years.
We can do the same directly in pandas as
Sample Input DF
df = pd.DataFrame({
'id':[11, 12],
'age_in_day':[22643, 10262]
})
id age_in_day
0 11 22643
1 12 10262
Two Options
Option 1 - Code - Pandas way
df['age_in_years'] = pd.to_timedelta(df.age_in_day, unit='D') / np.timedelta64(1,'Y')
Output
id age_in_day age_in_years
0 11 22643 61.994428
1 12 10262 28.096402
Option 2 - Code - float div
df['age_in_years'] = df['age_in_day'] / 365
df
Output
id age_in_day age_in_years
0 11 22643 62.035616
1 12 10262 28.115068
Other options
By any means if we have a separate list/Series(generated from an external function) of age in years which we want to add to existing df. Then
Series
0 62.035616
1 28.115068
Name: age_in_day, dtype: float64
df['age_in_years'] = age_Series
List
age_list=[62.035616438356165, 28.115068493150684]
df['age_in_years'] = age_list
Related
I have a dataframe of daily license_type activations (either full or trial) as shown below. Basically, I am trying to see the monthly count of Trial to Full License conversions. I am trying to do this by taking into consideration the daily data and the user_email column.
Date User_Email License_Type P.Letter Month (conversions)
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
4 2017-04-08 761179767639020420 full g 2017-04
The logic I have is to iteratively check the User_Email column. If the User_Email value is a duplicate, then check license_type column. If value in license_type = 'full' return 1 in a new column called 'Conversions' else return 0 in 'conversion' column. This would be the amendment to the original dataframe above.
Then group 'Date' column by month and I should have a aggregate value of monthly conversions in 'Conversion' column? Should look something like below:
Date
2017-Apr 1
2017-Feb 2
2017-Jan 1
2017-Jul 0
2017-Mar 1
Name: Conversion
below was my trial at getting the desire output above
#attempt to create a new column Conversion and fill with 1 and 0 for if converted or not.
for values in df['User_email']:
if value.is_unique:
df['Conversion'] = 0 #because there is no chance to go from trial to Full
else:
if df['License_type'] = 'full': #check if license type is full
df['Conversion'] = 1 #if full, I assume it was originally trial and now is full
# Grouping daily data by month to get monthly total of conversions
converted = df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
Your sample data doesn't have the features you note you are looking for. Rather than loop (always a pandas anti-pattern) have a simple function that operates row by row
for uniqueness test I'm getting a count of use of email address first and setting the number of times it occurs on each row
your logic I've transcribed in a slightly different way.
data = """ Date User_Email License_Type P.Letter Month
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
3 2017-03-13 2475366081966194134 full c 2017-03
3 2017-03-13 2475366081966194 full c 2017-03
4 2017-04-08 761179767639020420 full g 2017-04"""
a = [[t.strip() for t in re.split(" ",l) if t.strip()!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])
df["Date"] = pd.to_datetime(df["Date"])
df = df.assign(
emailc=df.groupby("User_Email")["User_Email"].transform("count"),
Conversion=lambda dfa: dfa.apply(lambda r: 0 if r["emailc"]==1 or r["License_Type"]=="trial" else 1, axis=1)
).drop("emailc", axis=1)
df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
output
Date
2017-Apr 0
2017-Feb 1
2017-Jan 0
2017-Jul 0
2017-Mar 1
I have a dataframe which looks like this
In []: df.head()
Out [] :
DATE NAME AMOUNT CURRENCY
2018-07-27 John 100 USD
2018-06-25 Jane 150 GBP
...
The contents under the DATE column are of date type.
I want to aggregate all the data to be able to see to understand the days of the month and the count of the number of transactions that happened on that date.
I also wanted to group it by year as well as day.
The end result I wanted would have looked something like this
YEAR DAY COUNT
2018 1 0
2 1
3 0
4 0
5 3
6 4
and so on
I used the following code but the numbers are all wrong. Please help
In []: df = pd.DataFrame({'DATE':pd.date_range(start=dt.datetime(2018,7,27),end=dt.datetime(2020,7,21))})
df.groupby([df['DATE'].dt.year, df['DATE'].dt.day]).agg({'count'})
I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.
I have the following pandas data frame:
Shortcut_Dimension_4_Code Stage_Code
10225003 2
8225003 1
8225004 3
8225005 4
It is part of a much larger dataset that I need to be able to filter by month and year. I need to pull the fiscal year from the first two digits for values larger than 9999999 in the Shortcut_Dimension_4_Code column, and the first digit for values less than or equal to 9999999. That value needs to be added to "20" to produce a year i.e. "20" + "8" = 2008 | "20" + "10" = 2010.
That year "2008, 2010" needs to be combined with the stage code value (1-12) to produce a month/year, i.e. 02/2010.
The date 02/2010 then needs to converted from fiscal year date to calendar year date, i.e. Fiscal Year Date : 02/2010 = Calendar Year date: 08/2009. The resulting date needs to be presented in a new column. The resulting df would end up looking like this:
Shortcut_Dimension_4_Code Stage_Code Date
10225003 2 08/2009
8225003 1 07/2007
8225004 3 09/2007
8225005 4 10/2007
I am new to pandas and python and could use some help. I am beginning with this:
Shortcut_Dimension_4_Code Stage_Code CY_Month Fiscal_Year
0 10225003 2 8.0 10
1 8225003 1 7.0 82
2 8225003 1 7.0 82
3 8225003 1 7.0 82
4 8225003 1 7.0 82
I used .map and .str methods to produce this df, but have not been able to figure out how to get the FY's right, for fy 2008-2009.
In below code, I'll assume Shortcut_Dimension_4_Code is an integer. If it's a string you can convert it or slice it like this: df['Shortcut_Dimension_4_Code'].str[:-6]. More explanations in comments alongside the code.
That should work as long as you don't have to deal with empty values.
import pandas as pd
import numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
fiscal_month_offset = 6
input_df = pd.DataFrame(
[[10225003, 2],
[8225003, 1],
[8225004, 3],
[8225005, 4]],
columns=['Shortcut_Dimension_4_Code', 'Stage_Code'])
# make a copy of input dataframe to avoid modifying it
df = input_df.copy()
# numpy will help us with numeric operations on large collections
df['fiscal_year'] = 2000 + np.floor_divide(df['Shortcut_Dimension_4_Code'], 1000000)
# loop with `apply` to create `date` objects from available columns
# day is a required field in date, so we'll just use 1
df['fiscal_date'] = df.apply(lambda row: date(row['fiscal_year'], row['Stage_Code'], 1), axis=1)
df['calendar_date'] = df['fiscal_date'] - relativedelta(months=fiscal_month_offset)
# by default python dates will be saved as Object type in pandas. You can verify with `df.info()`
# to use clever things pandas can do with dates we need co convert it
df['calendar_date'] = pd.to_datetime(df['calendar_date'])
# I would just keep date as datetime type so I could access year and month
# but to create same representation as in question, let's format it as string
df['Date'] = df['calendar_date'].dt.strftime('%m/%Y')
# copy important columns into output dataframe
output_df = df[['Shortcut_Dimension_4_Code', 'Stage_Code', 'Date']].copy()
print(output_df)
I've got order data with SKUs inside and would like to find out, how often a SKU has been bought per month over the last 3 years.
for row in df_skus.iterrows():
df_filtered = df_orders.loc[df_orders['item_sku'] == row[1]['sku']]
# Remove unwanted rows:
df_filtered = df_filtered[['txn_id', 'date', 'item_sku']].copy()
# Group by year and date:
df_result = df_filtered['date'].groupby([df_filtered.date.dt.year, df_filtered.date.dt.month]).agg('count')
print ( df_result )
print ( type ( df_result ) )
The (shortened) result looks good so far:
date date
2017 3 1
Name: date, dtype: int64
date date
2017 2 1
3 6
4 1
6 1
Name: date, dtype: int64
Now, I'd like to create a CSV which looks like that:
SKU 2017-01 2017-02 2017-03
17 0 0 1
18 0 1 3
Is it possible to simply 'convert' my data into the desired structure?
I do these kind of calculations all the time and this seems to be the fastest.
import pandas as pd
df_orders = df_orders[df_orders["item_sku"].isin(df_skus["sku"])]
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
monthly_sales.to_csv("my_csv.csv")
first line filters to the SKUs you want
the second line does a groupby and counts the number of sales per sku per month
the next line changes the dataframe from a multi index to the format you want
exports to csv