Change integer date format to pandas date format - python

I have the following dictionary:
{'date': {0: 20210101, 1: 20210102, 2: 20210103, 3: 20210104, 4: 20210105},
'users_with_pdp': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
I would like these dates in the following format: d-m-yyyy.
How can this be done?
Thanks in advance.
Try to get date format in pandas

Update
From a dataframe:
df['date'] = pd.to_datetime(df['date'], format='%Y%d%m').dt.strftime('%-d-%-m-%Y')
print(df)
# Output
date users_with_pdp
0 1-1-2021 1
1 1-2-2021 2
2 1-3-2021 3
3 1-4-2021 4
4 1-5-2021 5
You can use pd.to_datetime:
# This is a dict not a dataframe!
df = {0: 20220101, 1: 20220102, 2: 20220103, 3: 20220104, 4: 20220105}
sr = pd.to_datetime(pd.Series(df), format='%Y%m%d')
print(sr)
# Output
0 2022-01-01
1 2022-01-02
2 2022-01-03
3 2022-01-04
4 2022-01-05
dtype: datetime64[ns]
Update
If I want to adjust it to DD-MM-YYYY?
I want to change dd-mm-yyyy date value to d-m-yyyy format
sr = pd.to_datetime(pd.Series(df), format='%Y%m%d').dt.strftime('%-d-%-m-%Y')
print(sr)
# Output
0 1-1-2022
1 2-1-2022
2 3-1-2022
3 4-1-2022
4 5-1-2022
dtype: object
Caveat: you will lost the DatetimeIndex object and get string.

Related

Count occurrences by group and add a column of percentages in python

My problem: Given a data frame I want to count occurrences of "is_bot" by "processed" as a grouping variable and then calculate the percentages using "is_bot" as the grouping variable.
My data: My data looks like this:
is_bot processed
0 No Bot 2
1 self_declared 3
2 self_declared 1
3 No Bot 3
4 No Bot 2
5 No Bot 1
6 No Bot 3
7 No Bot 1
8 No Bot 1
9 No Bot 2
So far: I have successfully calculated the occurrences. Then I managed to calculate the percentages. But I can't manage to do it within the same data frame.
foo = df.groupby(["processed", "is_bot"]).size()
foo
processed is_bot
1 No Bot 3
self_declared 1
2 No Bot 3
3 No Bot 2
self_declared 1
dtype: int64
foo.groupby("is_bot").transform(lambda x: 100*x/x.sum())
processed is_bot
1 No Bot 37.5
self_declared 50.0
2 No Bot 37.5
3 No Bot 25.0
self_declared 50.0
dtype: float64
My Data: This is the dict for my raw data:
df = {'is_bot': {0: 'No Bot', 1: 'self_declared', 2: 'self_declared', 3: 'No Bot', 4: 'No Bot', 5: 'No Bot', 6: 'No Bot', 7: 'No Bot', 8: 'No Bot', 9: 'No Bot'}, 'processed': {0: 2, 1: 3, 2: 1, 3: 3, 4: 2, 5: 1, 6: 3, 7: 1, 8: 1, 9: 2}}
Extra: I am able to do it in R using dplyr:
df %>%
group_by(processed) %>%
count(is_bot) %>%
ungroup() %>%
group_by(is_bot) %>%
mutate(perc = n/sum(n)*100)
You need grouping by both columns first, get counts and then aggregate again but only one column. Here are possible alterative solutions:
df1 = (df.groupby(["processed", "is_bot"])
.size()
.div(df['is_bot'].value_counts(), level=1)
.mul(100))
print (df1)
Or join both:
df1 = (df.groupby(["processed", "is_bot"]).size()
.groupby("is_bot").transform(lambda x: 100*x/x.sum()))
Or use crosstab with normalize parameter, but it count also not match values like 0, so need replace them and reshape by DataFrame.stack:
df1 = (pd.crosstab(df['processed'], df['is_bot'], normalize=1)
.replace(0, np.nan)
.stack()
.mul(100))
print (df1)
processed is_bot
1 No Bot 37.5
self_declared 50.0
2 No Bot 37.5
3 No Bot 25.0
self_declared 50.0
dtype: float64
For both columns use:
s = df.groupby(["processed", "is_bot"]).size()
s1 = s.groupby("is_bot").transform(lambda x: 100*x/x.sum())
df2 = pd.concat([s, s1], axis=1, keys=('count','perc'))
Or:
df2 = (df.groupby(["processed", "is_bot"])
.size()
.to_frame('count')
.assign(perc = lambda x: x.groupby("is_bot").transform(lambda x: 100*x/x.sum())))
print (df2)
count perc
processed is_bot
1 No Bot 3 37.5
self_declared 1 50.0
2 No Bot 3 37.5
3 No Bot 2 25.0
self_declared 1 50.0

Rolling difference of date column with groupby in pandas

I have the following dataframe
import pandas as pd
from pandas import Timestamp
foo = pd.DataFrame.from_dict(data={'id': {0: '1',
1: '1',
2: '1',
3: '2',
4: '2'},
'session': {0: 3, 1: 2, 2: 1, 3: 1, 4: 2},
'start_time': {0: Timestamp('2021-09-02 19:49:19'),
1: Timestamp('2021-09-16 10:54:21'),
2: Timestamp('2021-07-12 17:11:54'),
3: Timestamp('2021-03-02 01:53:22'),
4: Timestamp('2021-01-09 11:38:35')}})
I would like to add a new column to foo, called diff_start_time, which would be the difference of the start_time column of the current session from the previous one, grouped by id. I would like the difference to be in hours.
How could I do that in python ?
Use DataFrameGroupBy.diff with Series.dt.total_seconds:
foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)
id session start_time diff_start_time
0 1 3 2021-09-02 19:49:19 NaN
1 1 2 2021-09-16 10:54:21 327.083889
2 1 1 2021-07-12 17:11:54 -1577.707500
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389
If necessary sorting first by id, session:
foo = foo.sort_values(['id','session'])
foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)
id session start_time diff_start_time
2 1 1 2021-07-12 17:11:54 NaN
1 1 2 2021-09-16 10:54:21 1577.707500
0 1 3 2021-09-02 19:49:19 -327.083889
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389
You can use .groupby() + diff() + dt.total_seconds() to get the total number of seconds in difference, then divide by 3600 to get the differences in hours.
df_out = foo.sort_values(['id', 'session'])
df_out['diff_start_time'] = df_out.groupby('id')['start_time'].diff().dt.total_seconds() / 3600
Result:
print(df_out)
id session start_time diff_start_time
2 1 1 2021-07-12 17:11:54 NaN
1 1 2 2021-09-16 10:54:21 1577.707500
0 1 3 2021-09-02 19:49:19 -327.083889
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389

In a pandas dataframe, how can I set the value of other columns based on the data from one column, without using a loop?

I'm trying to build a dataframe that will be used for linear regression. I would like to include 11 independent "dummy" variables that are set to either 1 or 0 based on the month of the year. Without getting too far off topic, I'm using 11 variables instead of 12, as the 12th month is captured by the intercept.
I know many things can be done with pandas without looping through the entire dataframe, and doing things in that manner are typically faster than using a loop.
So, is it possible to grab the month from my date column, and dynamically set a seperate column to either a 1 or a 0 based on that month? Or am I asking a stupid question?
Edit: I should have included more information.
A dataframe is structured like this:
Date
sku
units ordered
sessions
conversion rate
2020/01/30
abc123
20
200
0.1
2020/01/31
abc123
10
100
0.1
2020/02/01
abc123
15
60
0.25
I would like to make it look like this:
Date
sku
units ordered
sessions
conversion rate
january
february
2020/01/30
abc123
20
200
0.1
1
0
2020/01/31
abc123
10
100
0.1
1
0
2020/02/01
abc123
15
60
0.25
0
1
The code I'm currently using to accomplish this is:
x = 1
while x < 12:
month = calendar.month_name[x]
df[month] = 0
x += 1
for index, row in df.iterrows():
d = row[0]
month = d.strftime("%B")
if not month == "December":
df.at[index, month] = 1
df.fillna(0, inplace=True)
Just not sure if this is the best way to accomplish this.
My approach would be to first get the month number from every month using dt.month:
df['Date'].dt.month
0 1
1 1
2 2
Name: Date, dtype: int64
Then use crosstab with the index to get the tabulation of the counts:
pd.crosstab(
df.index,
df['Date'].dt.month
)
Date 1 2
row_0
0 1 0
1 1 0
2 0 1
Then merge back to the DF on index:
df = (
df.merge(pd.crosstab(
df.index,
df['Date'].dt.month
),
left_index=True,
right_index=True)
)
Output:
Date sku units ordered sessions conversion rate 1 2
0 2020-01-30 abc123 20 200 0.10 1 0
1 2020-01-31 abc123 10 100 0.10 1 0
2 2020-02-01 abc123 15 60 0.25 0 1
Finally, rename the columns using a mapper generated with the calendar api:
df = df.rename(columns={month_num: calendar.month_name[month_num]
for month_num in range(1, 13)})
All together:
import pandas as pd
import calendar
df = pd.DataFrame(
{'Date': {0: '2020/01/30', 1: '2020/01/31', 2: '2020/02/01'},
'sku': {0: 'abc123', 1: 'abc123', 2: 'abc123'},
'units ordered': {0: 20, 1: 10, 2: 15},
'sessions': {0: 200, 1: 100, 2: 60},
'conversion rate': {0: 0.1, 1: 0.1, 2: 0.25}})
df['Date'] = df['Date'].astype('datetime64[ns]')
df = (
df.merge(pd.crosstab(
df.index,
df['Date'].dt.month
),
left_index=True,
right_index=True)
)
df = df.rename(columns={month_num: calendar.month_name[month_num]
for month_num in range(1, 13)})
print(df.to_string())
Output:
Date sku units ordered sessions conversion rate January February
0 2020-01-30 abc123 20 200 0.10 1 0
1 2020-01-31 abc123 10 100 0.10 1 0
2 2020-02-01 abc123 15 60 0.25 0 1

two date columns to produce new series with selected dates

I have a DataFrame with two date columns, each row corresponding to a disjoint interval of time. I am trying to produce a series which contains as an index all dates from the minimum date to the maximum date from the original columns and has a value 1 if it is a date within one of the original time intervals.
pd.DataFrame({"A":[pd.Timestamp("2017-1-1"), pd.Timestamp("2017-2-1")],
"B": [pd.Timestamp("2017-1-3"), pd.Timestamp("2017-2-3")]})
id A B
0 2017-01-01 2017-01-03
1 2017-02-01 2017-02-03
To this,
pd.DataFrame({"A":[pd.Timestamp("2017-1-1"),pd.Timestamp("2017-1-2"),pd.Timestamp("2017-1-3"),
pd.Timestamp("2017-2-1"),pd.Timestamp("2017-2-2"),pd.Timestamp("2017-2-3")],
"B": [1,1,1,1,1,1]})
id A B
0 2017-01-01 1
1 2017-01-02 1
2 2017-01-03 1
3 2017-02-01 1
4 2017-02-02 1
5 2017-02-03 1
Not really pythonic but I think it solves your issue:
In [1]:
from datetime import date, timedelta
import pandas as pd
df = pd.DataFrame({"A":[pd.Timestamp("2017-1-1"), pd.Timestamp("2017-2-1")],
"B": [pd.Timestamp("2017-1-3"), pd.Timestamp("2017-2-3")]})
dates_list = []
for k in range(len(df)):
sdate = df.iloc[k, 0] # start date
edate = df.iloc[k, 1] # end date
delta = edate - sdate # as timedelta
for i in range(delta.days + 1):
day = sdate + timedelta(days=i)
dates_list.append(day)
final = pd.DataFrame(data=dates_list, columns=['A'])
final['B'] = 1
final
Out [1]:
A B
0 2017-01-01 1
1 2017-01-02 1
2 2017-01-03 1
3 2017-02-01 1
4 2017-02-02 1
5 2017-02-03 1

Pandas Dataframe Groupby Determine Values in 1 group vs. another group

I have a dataframe as follows:
Date ID
2014-12-31 1
2014-12-31 2
2014-12-31 3
2014-12-31 4
2014-12-31 5
2014-12-31 6
2014-12-31 7
2015-01-01 1
2015-01-01 2
2015-01-01 3
2015-01-01 4
2015-01-01 5
2015-01-02 1
2015-01-02 3
2015-01-02 7
2015-01-02 9
What I would like to do is determine the ID(s) on one date that are exclusive to that date versus the values of another date.
Example1: The result df would be the exclusive ID(s) in 2014-12-31 vs. the ID(s) in 2015-01-01 and the exclusive ID(s) in 2015-01-01 vs. the ID(s) in 2015-01-02:
2015-01-01 6
2015-01-01 7
2015-01-02 2
2015-01-02 4
2015-01-02 6
I would like to 'choose' how many days 'back' I compare. For instance I can enter a variable daysback=1 and each day would compare to the previous. Or I can enter variable daysback=2 and each day would compare to two days ago. etc.
Outside of df.groupby('Date'), I'm not sure where to go with this. Possibly use of diff()?
I'm assuming that the "Date" in your DataFrame is: 1) a date object and 2) not the index.
If those assumptions are wrong, then that changes things a bit.
import datetime
from datetime import timedelta
def find_unique_ids(df, date, daysback=1):
date_new = date
date_old = date - timedelta(days = daysback)
ids_new = df[df['Date'] == date_new]['ID']
ids_old = df[df['Date'] == date_old]['ID']
return df.iloc[ids_new[-ids_new.isin(ids_old)]]
date = datetime.date(2015, 1, 2)
daysback = 1
print find_unique_ids(df, date, daysback)
Running that produces the following output:
Date ID
7 2015-01-01 1
9 2015-01-01 3
If the Date is your Index field, then you need to modify two lines in the function:
ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID']
Output:
ID
Date
2015-01-01 1
2015-01-01 3
EDIT:
This is kind of dirty, but it should accomplish what you want to do. I added comments inline that explain what is going on. There are probably cleaner and more efficient ways to go about this if this is something that you're going to be running regularly or across massive amounts of data.
def find_unique_ids(df,daysback):
# We need both Date and ID to both be either fields or index fields -- no mix/match.
df = df.reset_index()
# Calculate DateComp by adding our daysback value as a timedelta
df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))
# Join df back on to itself, SQL style LEFT OUTER.
df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')
# Create series of missing_id values from the right table
missing_ids = (df2['Date_y'].isnull())
# Create series of valid DateComp values.
# DateComp is the "future" date that we're comparing against. Without this
# step, all records on the last Date value will be flagged as unique IDs.
valid_dates = df2['DateComp_x'].isin(df['Date'].unique())
# Use those to find missing IDs and valid dates. Create a new output DataFrame.
output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]
# Rename columns of output and return
output.columns = ['Date','ID']
return output
Test output:
Date ID
5 2015-01-01 6
6 2015-01-01 7
8 2015-01-02 2
10 2015-01-02 4
11 2015-01-02 5
EDIT:
missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe
Another way by applying list to aggregation,
df
Out[146]:
Date Unnamed: 2
0 2014-12-31 1
1 2014-12-31 2
2 2014-12-31 3
3 2014-12-31 4
4 2014-12-31 5
5 2014-12-31 6
6 2014-12-31 7
7 2015-01-01 1
8 2015-01-01 2
9 2015-01-01 3
10 2015-01-01 4
11 2015-01-01 5
12 2015-01-02 1
13 2015-01-02 3
14 2015-01-02 7
15 2015-01-02 9
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
abbs
Out[142]:
Date
2014-12-31 [1, 2, 3, 4, 5, 6, 7]
2015-01-01 [1, 2, 3, 4, 5]
2015-01-02 [1, 3, 7, 9]
Name: Unnamed: 2, dtype: object
abbs.loc['2015-01-01']
Out[143]: [1, 2, 3, 4, 5]
list(set(abbs.loc['2014-12-31']) - set(abbs.loc['2015-01-01']))
Out[145]: [6, 7]
In function
def uid(df,date1,date2):
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
return list(set(abbs.loc[date1]) - set(abbs.loc[date2]))
uid(df,'2015-01-01','2015-01-02')
Out[162]: [2, 4, 5]
You could write a function and use date instead of str :)

Categories

Resources