Arrange value_counts by month - python

I have the following column in my dataframe
year-month
2020-01
2020-01
2020-01
2020-02
2020-02
...
2021-06
This column is stored as an "object" type in my dataframe. I didn't convert it to a "datetime" type from the onset because then my values would change to "2020-01-01" instead(?)
Anyway, I wanted to get do a value_counts(), by month, so that I can plot it out subsequently. How can I order the value_counts() by month while reflecting the month as "Jan", "Feb"..."Dec" at the same time?
I've tried this:
pd.DateTime(df['year-month']).dt.month.value_counts().sort_index()
However, the months are reflected as "1","2"..."12" which isn't what I want
I then tried this:
pd.DateTime(df['year-month']).dt.strftime('%b').value_counts().sort_index()
Which gives me the month by "Jan","Feb"..."Dec" indeed but now it's sorted by alphabetical order instead of by the actual month sequence.

From this point of yours:
result = pd.to_datetime(df["year-month"]).dt.strftime("%b").value_counts()
we can reindex the result so that the index becomes the month name abbreviations in order. This can be borrowed from the calendar module:
import calendar
# slicing out the first since it is empty string
month_names = calendar.month_abbr[1:]
# reindex and put 0 to those that didn't appear at all
result = result.reindex(month_names, fill_value=0)
to get
>>> result
Jan 3
Feb 2
Mar 0
Apr 0
May 0
Jun 1
Jul 0
Aug 0
Sep 0
Oct 0
Nov 0
Dec 0
(The reason calendar.month_abbr has an empty string in the begining is because Python is 0-indexed but we say 2nd month is February; so putting an empty string there results in month_abbr[2] == "February".)

Related

How to filter dataframe based on condition that index is between date intervals?

I have 2 dataframes:
df_dec_light and df_rally.
df_dec_light.head():
log_return month year
1970-12-01 0.003092 12 1970
1970-12-02 0.011481 12 1970
1970-12-03 0.004736 12 1970
1970-12-04 0.006279 12 1970
1970-12-07 0.005351 12 1970
1970-12-08 -0.005239 12 1970
1970-12-09 0.000782 12 1970
1970-12-10 0.004235 12 1970
1970-12-11 0.003774 12 1970
1970-12-14 -0.005109 12 1970
df_rally.head():
rally_start rally_end
0 1970-12-18 1970-12-31
1 1971-12-17 1971-12-31
2 1972-12-15 1972-12-29
3 1973-12-21 1973-12-31
4 1974-12-20 1974-12-31
I need to filter df_dec_light based on condition that df_dec_light.index is between values of columns df_rally['rally_start']and df_rally['rally_end'].
I've tried something like this:
df_dec_light[(df_dec_light.index >= df_rally['rally_start']) & (df_dec_light.index <= df_rally['rally_end'])]
I was expecting to to recieve filtered df_dec_light dataframe with indexes that are within intervals between df_rally['rally_start'] and df_rally['rally_end'].
Something like this:
log_return month year
1970-12-18 0.001997 12 1970
1970-12-21 -0.003108 12 1970
1970-12-22 0.001111 12 1970
1970-12-23 0.000666 12 1970
1970-12-24 0.005644 12 1970
1970-12-28 0.005283 12 1970
1970-12-29 0.010810 12 1970
1970-12-30 0.002061 12 1970
1970-12-31 -0.001301 12 1970
Would really apreciate any help. Thanks!
Let's create an IntervalIndex from the start and end column values in df_rally dataframe, then map the intervals on index of df_dec_light dataframe and use notna to check if the index values are contained in any interval
ix = pd.IntervalIndex.from_arrays(df_rally.rally_start, df_rally.rally_end, closed='both')
mask = df_dec_light.index.map(ix.to_series()).notna()
then use the mask to filter the dataframe
df_dec_light[mask]
To solve this we can first turn the ranges in df_rally into pd.DateTimeIndex by calling pd.date_range on each row. This will give us each row of df_rally as a pd.DateTimeIndex.
As we want to later check if the index of df_dec_light is in any of the ranges, we want to combine all of these ranges. This is done with union.
We assert that the newly created pd.Series index_list is not empty and then select its first element. This element is the pd.DateTimeIndex on which we can now call union with all other pd.DateTimeIndex.
We can now use pd.Index.isin to create a boolean array of whether each index Date is found in the passed set of Dates.
If we now apply this mask to df_dec_light it returns only the entries that are within one of the specified ranges of df_rally.
index_list = df_rally.apply(lambda x: pd.date_range(x['rally_start'], x['rally_end']), axis=1)
assert(not index_list.empty)
all_ranges=index_list.iloc[0]
for range in index_list:
all_ranges=all_ranges.union(range)
print(all_ranges)
mask = df_dec_light.index.isin(all_ranges)
print(df_dec_light[mask])

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

TypeError: '_AtIndexer' object is not callable in pandas

I have a DataFrame object named df, and I want to generate a list of properly formatted dates. (datetime module is properly imported)
I wrote:
dates = [datetime.date(df.at(index, "year"), df.at(index, "month"), df.at(index, "day")) for index in df.index]
which gives the error in the title.
If it helps, this is the value of df.head():
year month day smoothed trend
0 2011 1 1 391.26 389.76
1 2011 1 2 391.29 389.77
2 2011 1 3 391.33 389.78
3 2011 1 4 391.36 389.78
4 2011 1 5 391.39 389.79
(This is new to me, so I have likely misinterpreted the docs)
df.at is not callable but a property that supports indexing. So change parantheses to square brackets around it:
df.at[index, "year"]
i.e. ( to [ and similar for closing.
Apart from using [ instead of ( you can achieve your goal simply by
pd.to_datetime(df[['year', 'month', 'day']])

If Current Date is in Column Then Show Row

I have a table in excel that the column header is the month and the rows are the days. I need to get today's current date which i have already done. Once i do this i need to match the month and day with the column "cy_day".
Example:
if todays day is jan 3 then it should only return "2".
Excel File:
cy_day jan feb mar
1 1 1 1
2 3 2 4
3 4 4 5
4 7 5 6
import pandas as pd
from pandas import DataFrame
import calendar
cycle_day_path = 'test\\Documents\\cycle_day_calendar.xlsx'
df = pd.read_excel(cycle_day_path)
df = DataFrame(df, index=None)
print(df)
month = pd.to_datetime('today').strftime("%b")
day = pd.to_datetime('today').strftime("%d")
Try this:
today = pd.Timestamp('2019-01-03')
col = today.strftime('%b').lower()
df[df[col] == today.day]
Given you've extracted month using '%b', it should just be this after correcting for the upper case in '%b' month name (see http://strftime.org/):
df.loc[df[month.lower()] == day, 'cy_day']
Now for Jan 3 you will get 2 (as a DataFrame). If you want just the number 2 do:
df.loc[df[month.lower()] == day, 'cy_day'].values[0]
the value of the month variable as returned by pd.to_datetime('today').strftime("%b") is a capitalized string, so in order to use is to access column from yo your dataframe should lowercase it.
so first you should do
month = month.lower()
after date you need to make sure that the values in your month columns are of type str since you are going to compare them with an str value.
day_of_month = df[month] == day
df["cy_day"][day_of_month]
if they are not of type str, you should convert the day variable to the same type as the month columns

Using 'isin' on a date in a pandas column

I have the following code that results in the following dataframe:
date=['1/1/2016','2/2/2017','4/8/2017','3/3/2015']
distance=['10','20','30','40']
dd=list(zip(date,distance))
df=pd.DataFrame(dd,columns=['date','distance'])
date distance
0 1/1/2016 10
1 2/2/2017 20
2 4/8/2017 30
3 3/3/2015 40
I would like to select all data for the year 2017. If I try the following I get an empty dataframe because it does not include the month and day also:
df=df[df['date'].isin(['2017'])]
Is there a way to accomplish this without splitting the date list into month,day, and year? If I have to split the date how would I be able to keep the corresponding distance?
df['date'] = pd.to_datetime(df['date'])
df = df[df['date'].dt.year == 2017]
If all you want is to filter '2017'. Then do it.
df[df.date.str.endswith('2017')

Categories

Resources