Extract monthly categorical (dummy) variables in pandas from a time series - python

So I have a dataframe (df) with dated data on a monthly time series (end of the month). It looks something like this:
Date Data
2010-01-31 625000
2010-02-28 750000
...
2014-10-31 450000
2014-11-30 475000
I would like to check on seasonal monthly effects.
This is probably simple to do, but how can I go about extracting the month from Date to create categorical dummy variables for use in a regression?
I want it to look something like this:
Date 01 02 03 04 05 06 07 08 09 10 11
2010-01-31 1 0 0 0 0 0 0 0 0 0 0
2010-02-28 0 1 0 0 0 0 0 0 0 0 0
...
2014-10-31 0 1 0 0 0 0 0 0 0 1 0
2014-11-30 0 1 0 0 0 0 0 0 0 0 1
I tried using pd.DataFrame(df.index.month, index=df.index)... which gives me the month for each date. I believe I need to use pandas.core.reshape.get_dummies to then get the variables in a 0/1 matrix format. Can someone show me how? Thanks.

This is how I got April:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=4, freq='MS')
df = pd.DataFrame(np.random.randn(4), index=dates, columns=['data'])
df.ix[dates.month==4]
The idea is to make the dates your index and then do boolean index selection on the dataframe.
>>> df
data
2013-01-01 0.141205
2013-02-01 0.115361
2013-03-01 -0.309521
2013-04-01 -0.236317
>>> df.ix[dates.month==4]
data
2013-04-01 -0.236317

Related

Convert binary to dates Pandas Python

I have a binary like this:
0001111000011111111111110001011011000000000011111100000111110
I want to convert a range of numbers to dates starting from 01/10/2021 to 11/30/2021, knowing that each number in the range corresponds to a date.
The value 1 represents the day out and the value 0 represents the day at home.
So output:
Day
Code
01/10/2021
0
02/10/2021
0
03/10/2021
0
04/10/2021
1
....
....
30/11/2021
0
How can I do? Thank you for help!!!
Build your dataframe like this:
code = '0001111000011111111111110001011011000000000011111100000111110'
start_date = '2021-10-01'
df = pd.DataFrame({'Day': pd.date_range(start_date, periods=len(code), freq='D'),
'Code': list(code)})
Output:
>>> df
Day Code
0 2021-10-01 0
1 2021-10-02 0
2 2021-10-03 0
3 2021-10-04 1
4 2021-10-05 1
.. ... ...
56 2021-11-26 1
57 2021-11-27 1
58 2021-11-28 1
59 2021-11-29 1
60 2021-11-30 0
[61 rows x 2 columns]

Is there a python function to shift revenue column by specific timedelta?

Please refer to the image attached
I have a data frame that has yearly revenue in columns (2020 to 2025). I want to shift the revenue in those columns by a given time delta(column Time Shift). The time delta I have is in terms of days. Is there an efficient way to make the shift?
E.G
What I want to achieve is to shift the yearly revenue in columns by the value of days in the Time Shift column i.e. 4 days of revenue to shift from column to column ( i.e. 1.27[116/365 * 4] should be shifted from 2022 to 2023 for the 1st row)
Thanks in Advance
Text Input data
Launch Date Launch Date Base Time Shift 2020 2021 2022 2023 2024 2025
2022-06-01 2022-06-01 4 0 0 115.98 122.93 119.22 35.31
2025-02-01 2025-02-01 4 0 0 0 0 0 66.18859318
2022-09-01 2022-09-01 4 49.42 254.86 191.12 248.80 206.53 98.22
2025-01-01 2025-01-01 4 0 0 0 0 14.47 54.24
2022-06-01 2022-06-01 4 0 0 50.25 53.26 51.65 15.30
2025-02-01 2025-02-01 4 0 0 0 0 0 28.67
2022-09-01 2022-09-01 4 148.20 758.22 535.45 676.73 545.42 251.83
2025-01-01 2025-01-01 4 0 0 0 0 38.23 139.07
2022-06-01 2022-06-01 4 0 0 140.78 144.88 136.41 39.23
You can figour out how much to shift per year and then subtract from the current year and add to the next year.
Get the column names of interest
ycols = [str(n) for n in range(2020,2026)]
calculate the amount that needs shifting, per year (to the next year):
shift_df = df[ycols].multiply(df['Time_Shift']/365.0, axis=0)
looks like this
2020 2021 2022 2023 2024 2025
-- -------- ------- -------- -------- -------- --------
0 0 0 1.27101 1.34718 1.30652 0.386959
1 0 0 0 0 0 0.725354
2 0.541589 2.79299 2.09447 2.72658 2.26334 1.07638
3 0 0 0 0 0.158575 0.594411
4 0 0 0.550685 0.583671 0.566027 0.167671
5 0 0 0 0 0 0.314192
6 1.62411 8.30926 5.86795 7.41622 5.97721 2.75978
7 0 0 0 0 0.418959 1.52405
8 0 0 1.54279 1.58773 1.4949 0.429918
Now create a copy of df (could use the original if you want of course) and apply the operations:
df2 = df.copy()
df2[ycols] = df2[ycols] - shift_df[ycols]
df2[ycols[1:]] =df2[ycols[1:]] + shift_df[ycols[:-1]].values
slightly tricky bits here are in the last line -- we use the indexing [1:] and [:-1] appropriately to access previous year shift, and also use .values method otherwise the column labels do not match and you can do do the addition.
After this we get df2:
Launch_Date Launch_Date_Base Time_Shift 2020 2021 2022 2023 2024 2025
-- ------------- ------------------ ------------ -------- ------- -------- ------- -------- --------
0 2022-06-01 2022-06-01 4 0 0 114.709 122.854 119.261 36.2296
1 2025-02-01 2025-02-01 4 0 0 0 0 0 65.4632
2 2022-09-01 2022-09-01 4 48.8784 252.609 191.819 248.168 206.993 99.407
3 2025-01-01 2025-01-01 4 0 0 0 0 14.3114 53.8042
4 2022-06-01 2022-06-01 4 0 0 49.6993 53.227 51.6676 15.6984
5 2025-02-01 2025-02-01 4 0 0 0 0 0 28.3558
6 2022-09-01 2022-09-01 4 146.576 751.535 537.891 675.182 546.859 255.047
7 2025-01-01 2025-01-01 4 0 0 0 0 37.811 137.965
8 2022-06-01 2022-06-01 4 0 0 139.237 144.835 136.503 40.295
As you noticed the amount shifted from year 2026 is 'lost' ie we do not assign it to any new column

Pandas new column as result of sum in date range

having this dataframe:
provincia contagios defunciones fecha
0 distrito nacional 11 0 18/3/2020
1 azua 0 0 18/3/2020
2 baoruco 0 0 18/3/2020
3 dajabon 0 0 18/3/2020
4 barahona 0 0 18/3/2020
How can I have a new dataframe like this:
provincia contagios_from_march1_8 defunciones_from_march1_8
0 distrito nacional 11 0
1 azua 0 0
2 baoruco 0 0
3 dajabon 0 0
4 barahona 0 0
Where the 'contagios_from_march1_8' and 'defunciones_from_march1_8' are the result of the sum of the 'contagios' and 'defunciones' in the date range 3/1/2020 to 3/8/2020.
Thanks.
Can use df.sum on a condition. Eg.:
df[df["date"]<month]["contagios"].sum()
refer to this for extracting month out of date: Extracting just Month and Year separately from Pandas Datetime column

Using Pandas to build a 2D table based on COUNTIF()'s of a separate excel sheet

I would like to build a 2D table based on values (and countifs) from another table. I managed to prototype this successfully using Excel, however I am stuck with two concepts:
1. Emulating Excel COUNTIF() on pandas
2. Dynamically build a new dataframe
Note: COUNTIF() takes a range and an criterion as an argument. For example if I have a list of colors and I would like to know the number of times 'orange' is in the list below:
A
Red
Orange
Blue
Orange
Black
, then I would simply use the following formula:
COUNTIF(A1:A5, "Orange")
This should return 2.
Of course COUNTIF() functions can become more complex, such as form example concatenating criteria in this form COUNTIF(range1, criterion1, range2, criterion2...) can be interpreted as an AND criterian. Example if I wantto count of females over 35 in a list similar to the below:
A B
Female 19
Female 40
Male 45
, then I would simply use the following formula:
COUNTIF(A1:A3, "Female", B1:B3, ">35"
This should return 1.
Back to my use case. This is the source table:
Product No Opening Date Closing Date Opening Month Closing Month
0 1 2016-01-01 2016-06-30 2016-01-31 2016-06-30
1 2 2016-01-01 2016-04-30 2016-01-31 2016-04-30
2 3 2016-02-01 2016-06-30 2016-02-29 2016-06-30
3 4 2016-02-01 2016-05-31 2016-02-29 2016-05-31
4 5 2016-02-01 2099-12-31 2016-02-29 2099-12-31
5 6 2016-01-01 2099-12-31 2016-01-31 2016-10-31
6 7 2016-06-01 2016-07-31 2016-06-30 2016-07-31
7 8 2016-06-01 2016-11-30 2016-06-30 2016-11-30
8 9 2016-06-01 2016-07-31 2016-06-30 2016-07-31
9 10 2016-06-01 2099-12-31 2016-06-30 2099-12-31
And this is the 2d matrix that I want to achieve:
2016-01-31 2016-02-29 2016-03-31 2016-04-30 2016-05-31 \
2016-01-31 3 3 3 2 2
2016-02-29 3 3 3 3 2
2016-03-31 0 0 0 0 0
2016-04-30 0 0 0 0 0
2016-05-31 0 0 0 0 0
2016-06-30 4 4 4 4 4
2016-07-31 0 0 0 0 0
2016-08-31 0 0 0 0 0
2016-09-30 0 0 0 0 0
2016-10-31 0 0 0 0 0
2016-11-30 0 0 0 0 0
2016-12-31 0 0 0 0 0
2016-06-30 2016-07-31 2016-08-31 2016-09-30 2016-10-31 \
2016-01-31 1 1 1 1 0
2016-02-29 1 1 1 1 1
2016-03-31 0 0 0 0 0
2016-04-30 0 0 0 0 0
2016-05-31 0 0 0 0 0
2016-06-30 4 2 2 2 2
2016-07-31 0 0 0 0 0
2016-08-31 0 0 0 0 0
2016-09-30 0 0 0 0 0
2016-10-31 0 0 0 0 0
2016-11-30 0 0 0 0 0
2016-12-31 0 0 0 0 0
2016-11-30 2016-12-31
2016-01-31 0 0
2016-02-29 1 1
2016-03-31 0 0
2016-04-30 0 0
2016-05-31 0 0
2016-06-30 1 1
2016-07-31 0 0
2016-08-31 0 0
2016-09-30 0 0
2016-10-31 0 0
2016-11-30 0 0
2016-12-31 0 0
Basically I want to build a matrix of product survival through time. The vertical axis holds the origination of new products whereas the horizontal axis measures how much of these accounts persist through time.
For example if 10 products were launched in January, the figure for January vs January should be 10. If 1 of these 10 products was closed in February, the figure for January vs February should be 9. If all the remaining products were closed by June, then the rows January vs June, July, August, etc should be 0.
Product development in February, March, April, etc.. will not affect the January row.
I managed to build the 2d matrix using the following excel formula:
=COUNTIF(Accounts!$D$2:$D$11,Main!$A2)-COUNTIFS(Accounts!$D$2:$D$11,Main!$A2, Accounts!$E$2:$E$11,"<="&Main!B$1)
(this will populate the first cell)
My initial strategy was to build a multi-dimensional list and using a number of for-loops to populate them, but I am not sure whether there's an easier (or more recommended way) in Pandas.
Since I don't have enough reputation to comment on your question just yet, I'm going to assume that you have typos in your data where the year is equal to 2099.
I would also like to ask how in your 2016-06-30 row there are 4 'Product No' that somehow existed in the first few columns (i.e. 2016-01-31 to 2016-05-31).
If those are errors then here is my solution:
First, make the data:
# Make dataframe
df = pd.DataFrame({'Product No' : [i for i in range(1,11)],
'Opening Date' : ['2016-01-01']*2 +\
['2016-02-01']*3 +\
['2016-01-01'] +\
['2016-06-01']*4,
'Closing Date' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31'] +\
['2016-12-31']*2 +\
['2016-07-31', '2016-11-30', '2016-07-31', '2016-12-31'],
'Opening Month' : ['2016-01-31']*2 +\
['2016-02-29']*3 +\
['2016-01-31'] +\
['2016-06-30']*4,
'Closing Month' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31',
'2016-12-31', '2016-10-31', '2016-07-31', '2016-11-30',
'2016-07-31', '2016-12-31']})
# Reorder columns
df = df.loc[:, ['Product No', 'Opening Date', 'Closing Date',
'Opening Month', 'Closing Month']]
# Convert dates to datetime
for i in df.columns[1:]:
df.loc[:, i] = pd.to_datetime(df.loc[:, i])
Second, I created a 'daterange' dataframe for holding the min to max dates of the original data set. I also included a 'Product No' column so that each Product would have a row on the table:
# Create date range dataframe
daterange = pd.DataFrame({'daterange' : pd.date_range(start = df.loc[:, 'Opening Month'].min(),
end = df.loc[:, 'Closing Month'].max(),
freq = 'M'),
'Product No' : [1]*12})
# Create 10 multiples of the daterange and concatenate
daterange10 = pd.concat([daterange]*10)
# Find the cumulative sum of the 'Product No' for daterange10
daterange10.loc[:, 'Product No'] = daterange10.groupby('daterange').cumsum()
Third, I merge the daterange and original df together and limit rows to only include when a 'Product No' existed. Also note that I have it so the closed dates must be greater than or equal to the daterange since (in my opinion) if the product closed on the last day of the month, then it existed during that whole month:
# Merge df with daterange10
df = df.merge(daterange10,
how = 'inner',
on = 'Product No')
# Limit rows to when 'Opening Month' is <= 'daterange' and 'Closing Month' is >= 'daterange'
df = df[(df.loc[:, 'Opening Month'] <= df.loc[:, 'daterange']) &
(df.loc[:, 'Closing Month'] >= df.loc[:, 'daterange'])]
Last, I make a pivot table with the date values. Note that it only includes dates on the vertical axis that existed in the first place:
# Pivot on 'Opening Month', 'daterange'; count unique 'Product No'; fill NA with 0
df.pivot_table(index = 'Opening Month',
columns = 'daterange',
values = 'Product No',
aggfunc = pd.Series.nunique).fillna(0)
Try putting your data into a pandas DataFrame, then using an iterative approach to build the product survival DataFrame:
import pandas as pd
mydata = pd.read_excel('mysourcedata.xlsx')
def product_survival(sourcedf, startdate, enddate):
df = pd.DataFrame()
daterange = pd.date_range(startdate, enddate, freq='M')
for i in daterange: # Rows
for j in daterange: # Columns
mycount = sourcedf[(sourcedf['Opening Month'] == i) & (sourcedf['Closing Month'] > j)]['Product No'].count()
df.loc[i, j] = mycount
return df
print(product_survival(mydata, '2016-01-31', '2016-12-31'))

PYTHON: Pandas datetime index range to change column values

I have a dataframe indexed using a 12hr frequency datetime:
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 0
2007-09-28 12:00:00 NaN NaN 0
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I use column 'ls' as a binary variable with default value '0' using:
data['ls'] = 0
I have a list of days in the form '2007-09-28' from which I wish to update all 'ls' values from 0 to 1.
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 1
2007-09-28 12:00:00 NaN NaN 1
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I understand how this can be done using another column variable ie:
data.ix[data.id == '1'], ['ls'] = 1
yet this does not work using datetime index.
Could you let me know what the method for datetime index is?
You have a list of days in the form '2007-09-28':
days = ['2007-09-28', ...]
then you can modify your df using:
df['ls'][pd.DatetimeIndex(df.index.date).isin(days)] = 1

Categories

Resources