I have a binary like this:
0001111000011111111111110001011011000000000011111100000111110
I want to convert a range of numbers to dates starting from 01/10/2021 to 11/30/2021, knowing that each number in the range corresponds to a date.
The value 1 represents the day out and the value 0 represents the day at home.
So output:
Day
Code
01/10/2021
0
02/10/2021
0
03/10/2021
0
04/10/2021
1
....
....
30/11/2021
0
How can I do? Thank you for help!!!
Build your dataframe like this:
code = '0001111000011111111111110001011011000000000011111100000111110'
start_date = '2021-10-01'
df = pd.DataFrame({'Day': pd.date_range(start_date, periods=len(code), freq='D'),
'Code': list(code)})
Output:
>>> df
Day Code
0 2021-10-01 0
1 2021-10-02 0
2 2021-10-03 0
3 2021-10-04 1
4 2021-10-05 1
.. ... ...
56 2021-11-26 1
57 2021-11-27 1
58 2021-11-28 1
59 2021-11-29 1
60 2021-11-30 0
[61 rows x 2 columns]
Related
Be the following DataFrame in python pandas:
date
time_SEL
time_02_SEL_01
time_03_SEL_05
other
2022-01-01
34756
233232
3432423
756
2022-01-03
23322
4343
3334
343
2022-02-01
123232
3242
23423
434
2022-03-01
7323232
32423
323423
34324
All columns other than date represent a fraction of time in seconds. My idea is to pass these values to TimeDelta, keeping in mind that I only want to apply the change to columns containing the string "_SEL".
Naturally I want to apply them per string, because in the original dataset, there will be more than 3 columns with this string. If there were only 3, I would know how to do it manually.
You can apply pandas.to_timedelta on all columns selected by filter and update the original dataframe:
df.update(df.filter(like='_SEL').apply(pd.to_timedelta, unit='s'))
NB. there is no output, the modification is inplace
updated dataframe:
date time_SEL time_02_SEL time_03_SEL other
0 2022-01-01 0 days 09:39:16 2 days 16:47:12 39 days 17:27:03 756
1 2022-01-03 0 days 06:28:42 0 days 01:12:23 0 days 00:55:34 343
2 2022-02-01 1 days 10:13:52 0 days 00:54:02 0 days 06:30:23 434
3 2022-03-01 84 days 18:13:52 0 days 09:00:23 3 days 17:50:23 34324
update "TypeError: invalid type promotion"
ensure you have numbers:
(df.update(df.filter(like='_SEL')
.apply(lambda c: pd.to_timedelta(pd.to_numeric(c, errors='coerce'),
unit='s'))
)
Use DataFrame.filter for get all columns ends by _SEL, convert to timedeltas by to_timedelta and replace original by DataFrame.update:
df.update(df.filter(regex='_SEL$').apply(lambda x: pd.to_timedelta(x, unit='s')))
print (df)
date time_SEL time_02_SEL time_03_SEL other
0 2022-01-01 0 days 09:39:16 2 days 16:47:12 39 days 17:27:03 756
1 2022-01-03 0 days 06:28:42 0 days 01:12:23 0 days 00:55:34 343
2 2022-02-01 1 days 10:13:52 0 days 00:54:02 0 days 06:30:23 434
3 2022-03-01 84 days 18:13:52 0 days 09:00:23 3 days 17:50:23 34324
Another idea is filter column by Series.str.endswith:
m = df.columns.str.endswith('_SEL')
df.loc[:, m] = df.loc[:, m].apply(lambda x: pd.to_timedelta(x, unit='s'))
print (df)
date time_SEL time_02_SEL time_03_SEL other
0 2022-01-01 0 days 09:39:16 2 days 16:47:12 39 days 17:27:03 756
1 2022-01-03 0 days 06:28:42 0 days 01:12:23 0 days 00:55:34 343
2 2022-02-01 1 days 10:13:52 0 days 00:54:02 0 days 06:30:23 434
3 2022-03-01 84 days 18:13:52 0 days 09:00:23 3 days 17:50:23 34324
EDIT: For convert values of columns to integers use .astype(int):
df.update(df.filter(regex='_SEL$').astype(int).apply(lambda x: pd.to_timedelta(x, unit='s')))
If failed, because some non numeric values use:
df.update(df.filter(regex='_SEL$').apply(lambda x: pd.to_timedelta(pd.to_numeric(x, errors='coerce'), unit='s')))
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I have a DataFrame df, that, once sorted by date, looks like this:
User Date price
0 2 2020-01-30 50
1 1 2020-02-02 30
2 2 2020-02-28 50
3 2 2020-04-30 10
4 1 2020-12-28 10
5 1 2020-12-30 20
I want to compute, for each row:
the number of row in the last month, and
the sum price in the last month.
On the example above, the output that I'm looking for:
User Date price NumlastMonth Totallastmonth
0 2 2020-01-30 50 0 0
1 1 2020-02-02 30 0 0 # not 1, 50 ???
2 2 2020-02-28 50 1 50
3 2 2020-04-30 10 0 0
4 1 2020-12-28 10 0 0
5 1 2020-12-30 20 1 10 # not 0, 0 ???
I tried this, but the result is for all last row not only last month.
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumcount()
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumsum()
Taking literally the question (acknowledging that the example doesn't quite match the description of the question), we could do:
tally = df.groupby(pd.Grouper(key='Date', freq='M')).agg({'User': 'count', 'price': sum})
tally.index += pd.offsets.Day(1)
tally = tally.reindex(index=df.Date, method='ffill', fill_value=0)
On your input, that gives:
>>> tally
User price
Date
2020-01-30 0 0
2020-02-02 1 50
2020-02-28 1 50
2020-04-30 0 0
2020-12-28 0 0
2020-12-30 0 0
After that, it's easy to change the column names and concat:
df2 = pd.concat([
df.set_index('Date'),
tally.rename(columns={'User': 'NumlastMonth', 'price': 'Totallastmonth'})
], axis=1)
# out:
User price NumlastMonth Totallastmonth
Date
2020-01-30 2 50 0 0
2020-02-02 1 30 1 50
2020-02-28 2 50 1 50
2020-04-30 2 10 0 0
2020-12-28 1 10 0 0
2020-12-30 1 20 0 0
```
So I have a dataframe (df) with dated data on a monthly time series (end of the month). It looks something like this:
Date Data
2010-01-31 625000
2010-02-28 750000
...
2014-10-31 450000
2014-11-30 475000
I would like to check on seasonal monthly effects.
This is probably simple to do, but how can I go about extracting the month from Date to create categorical dummy variables for use in a regression?
I want it to look something like this:
Date 01 02 03 04 05 06 07 08 09 10 11
2010-01-31 1 0 0 0 0 0 0 0 0 0 0
2010-02-28 0 1 0 0 0 0 0 0 0 0 0
...
2014-10-31 0 1 0 0 0 0 0 0 0 1 0
2014-11-30 0 1 0 0 0 0 0 0 0 0 1
I tried using pd.DataFrame(df.index.month, index=df.index)... which gives me the month for each date. I believe I need to use pandas.core.reshape.get_dummies to then get the variables in a 0/1 matrix format. Can someone show me how? Thanks.
This is how I got April:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=4, freq='MS')
df = pd.DataFrame(np.random.randn(4), index=dates, columns=['data'])
df.ix[dates.month==4]
The idea is to make the dates your index and then do boolean index selection on the dataframe.
>>> df
data
2013-01-01 0.141205
2013-02-01 0.115361
2013-03-01 -0.309521
2013-04-01 -0.236317
>>> df.ix[dates.month==4]
data
2013-04-01 -0.236317
I have a DataFrame with events. One or more events can occur at a date (so the date can't be an index). The date range is several years. I want to groupby years and months and have a count of the Category values. Thnx
in [12]: df = pd.read_excel('Pandas_Test.xls', 'sheet1')
In [13]: df
Out[13]:
EventRefNr DateOccurence Type Category
0 86596 2010-01-02 00:00:00 3 Small
1 86779 2010-01-09 00:00:00 13 Medium
2 86780 2010-02-10 00:00:00 6 Small
3 86781 2010-02-09 00:00:00 17 Small
4 86898 2010-02-10 00:00:00 6 Small
5 86898 2010-02-11 00:00:00 6 Small
6 86902 2010-02-17 00:00:00 9 Small
7 86908 2010-02-19 00:00:00 3 Medium
8 86908 2010-03-05 00:00:00 3 Medium
9 86909 2010-03-06 00:00:00 8 Small
10 86930 2010-03-12 00:00:00 29 Small
11 86934 2010-03-16 00:00:00 9 Small
12 86940 2010-04-08 00:00:00 9 High
13 86941 2010-04-09 00:00:00 17 Small
14 86946 2010-04-14 00:00:00 10 Small
15 86950 2011-01-19 00:00:00 12 Small
16 86956 2011-01-24 00:00:00 13 Small
17 86959 2011-01-27 00:00:00 17 Small
I tried:
df.groupby(df['DateOccurence'])
For the month and year break out I often add additional columns to the data frame that break out the dates into each piece:
df['year'] = [t.year for t in df.DateOccurence]
df['month'] = [t.month for t in df.DateOccurence]
df['day'] = [t.day for t in df.DateOccurence]
It adds space complexity (adding columns to the df) but is less time complex (less processing on groupby) than a datetime index but it's really up to you. datetime index is the more pandas way to do things.
After breaking out by year, month, day you can do any groupby you need.
df.groupby['year','month'].Category.apply(pd.value_counts)
To get months across multiple years:
df.groupby['month'].Category.apply(pd.value_counts)
Or in Andy Hayden's datetime index
df.groupby[di.month].Category.apply(pd.value_counts)
You can simply pick which method fits your needs better.
You can apply value_counts to the SeriesGroupby (for the column):
In [11]: g = df.groupby('DateOccurence')
In [12]: g.Category.apply(pd.value_counts)
Out[12]:
DateOccurence
2010-01-02 Small 1
2010-01-09 Medium 1
2010-02-09 Small 1
2010-02-10 Small 2
2010-02-11 Small 1
2010-02-17 Small 1
2010-02-19 Medium 1
2010-03-05 Medium 1
2010-03-06 Small 1
2010-03-12 Small 1
2010-03-16 Small 1
2010-04-08 High 1
2010-04-09 Small 1
2010-04-14 Small 1
2011-01-19 Small 1
2011-01-24 Small 1
2011-01-27 Small 1
dtype: int64
I actually hoped this to return the following DataFrame, but you need to unstack it:
In [13]: g.Category.apply(pd.value_counts).unstack(-1).fillna(0)
Out[13]:
High Medium Small
DateOccurence
2010-01-02 0 0 1
2010-01-09 0 1 0
2010-02-09 0 0 1
2010-02-10 0 0 2
2010-02-11 0 0 1
2010-02-17 0 0 1
2010-02-19 0 1 0
2010-03-05 0 1 0
2010-03-06 0 0 1
2010-03-12 0 0 1
2010-03-16 0 0 1
2010-04-08 1 0 0
2010-04-09 0 0 1
2010-04-14 0 0 1
2011-01-19 0 0 1
2011-01-24 0 0 1
2011-01-27 0 0 1
If there were multiple different Categories with the same Date they would be on the same row...