Assume that I have the following data set
import pandas as pd, numpy, datetime
start, end = datetime.datetime(2015, 1, 1), datetime.datetime(2015, 12, 31)
date_list = pd.date_range(start, end, freq='B')
numdays = len(date_list)
value = numpy.random.normal(loc=1e3, scale=50, size=numdays)
ids = numpy.repeat([1], numdays)
test_df = pd.DataFrame({'Id': ids,
'Date': date_list,
'Value': value})
I would now like to calculate the maximum within each business quarter for test_df. One possiblity is to use resample using rule='BQ', how='max'. However, I'd like to keep the structure of the array and just generate another column with the maximum for each BQ, have you guys got any suggestions on how to do this?
I think the following should work for you, this groups on the quarter and calls transform on the 'Value' column and returns the maximum value as a Series with it's index aligned to the original df:
In [26]:
test_df['max'] = test_df.groupby(test_df['Date'].dt.quarter)['Value'].transform('max')
test_df
Out[26]:
Date Id Value max
0 2015-01-01 1 1005.498555 1100.197059
1 2015-01-02 1 1032.235987 1100.197059
2 2015-01-05 1 986.906171 1100.197059
3 2015-01-06 1 984.473338 1100.197059
........
256 2015-12-25 1 997.965285 1145.215837
257 2015-12-28 1 929.652812 1145.215837
258 2015-12-29 1 1086.128017 1145.215837
259 2015-12-30 1 921.663949 1145.215837
260 2015-12-31 1 938.189566 1145.215837
[261 rows x 4 columns]
Related
I have some data I want to count by month. The column I want count has three different possible values, each representing a different car sold. Here is an example of my dataframe:
Date Type_Car_Sold
2015-01-01 00:00:00 2
2015-01-01 00:00:00 1
2015-01-01 00:00:00 1
2015-01-01 00:00:00 3
... ...
I want to make it so I have a dataframe that counts each specific car type sold by month separately, so looking like this:
Month Car_Type_1 Car_Type_2 Car_Type_3 Total_Cars_Sold
1 15 12 17 44
2 9 18 20 47
... ... ... ... ...
How exactly would I go about doing this? I've tried doing:
cars_sold = car_data['Type_Car_Sold'].groupby(car_data.Date.dt.month).agg('count')
but that just sums up all the cars sold in the month, rather than breaking it down by the total amount of each type sold. Any thoughts?
Maybe not the cleanest solution, but this should get you pretty close
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df['Value'] = 1
print(pd.pivot_table(df, values='Value', index=['Date'], columns=['Type'], aggfunc='count'))
Type 1 2
Date
2022-01 1.0 1.0
2022-02 2.0 NaN
Alternatively you can also pass multiple columns to groupby:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df.groupby(['Date', 'Type']).size()
Date Type
2022-01 1 1
2 1
2022-02 1 2
dtype: int64
This seems to have the unfortunate side effect of excluding keys with zero value. Also the result is multiindexed rows rather than having the index as rows+columns.
For more information on this approach, check this question.
Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03
I'm still a novice with python and I'm having problems trying to group some data to show that record that has the highest (maximum) date, the dataframe is as follows:
...
I am trying the following:
df_2 = df.max(axis = 0)
df_2 = df.periodo.max()
df_2 = df.loc[df.groupby('periodo').periodo.idxmax()]
And it gives me back:
Timestamp('2020-06-01 00:00:00')
periodo 2020-06-01 00:00:00
valor 3.49136
Although the value for 'periodo' is correct, for 'valor' it is not, since I need to obtain the corresponding complete record ('period' and 'value'), and not the maximum of each one. I have tried other ways but I can't get to what I want ...
I need to do?
Thank you in advance, I will be attentive to your answers!
Regards!
# import packages we need, seed random number generator
import pandas as pd
import datetime
import random
random.seed(1)
Create example dataframe
dates = [single_date for single_date in (start_date + datetime.timedelta(n) for n in range(day_count))]
values = [random.randint(1,1000) for _ in dates]
df = pd.DataFrame(zip(dates,values),columns=['dates','values'])
ie df will be:
dates values
0 2020-01-01 389
1 2020-01-02 808
2 2020-01-03 215
3 2020-01-04 97
4 2020-01-05 500
5 2020-01-06 30
6 2020-01-07 915
7 2020-01-08 856
8 2020-01-09 400
9 2020-01-10 444
Select rows with highest entry in each column
You can do:
df[df['dates'] == df['dates'].max()]
(Or, if wanna use idxmax, can do: df.loc[[df['dates'].idxmax()]])
Returning:
dates values
9 2020-01-10 444
ie this is the row with the latest date
&
df[df['values'] == df['values'].max()]
(Or, if wanna use idxmax again, can do: df.loc[[df['values'].idxmax()]] - as in Scott Boston's answer.)
and
dates values
6 2020-01-07 915
ie this is the row with the highest value in the values column.
Reference.
I think you need something like:
df.loc[[df['valor'].idxmax()]]
Where you use idxmax on the 'valor' column. Then use that index to select that row.
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'periodo':pd.date_range('2018-07-01', periods = 600, freq='d'),
'valor':np.random.random(600)+3})
df.loc[[df['valor'].idxmax()]]
Output:
periodo valor
474 2019-10-18 3.998918
I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)
I have a large data set with names of stores, dates and profits.
My data set is not the most organized but I now have it in this df.
df
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
I proudly created a function to get each day into one df by itself until I realized it would be very time consuming to do one for each day.
def avg(n):
return df.loc[df['Date'] == "May" + " " + str(n) + " " +str(2018)]
where n would be the date I want to get. So that function get me just the dates I want.
What I really need is to have a way to get all dates I want in a list and to append them to a pd for each day. I tried doing this but did not work out.
def avg(n):
dlist= []
for i in n:
dlist= df.loc[df['Date'] == "May" + " " + str(i) + " " +str(2018)]
dlist=pd.DataFrame(dlist)
dlist.append(i)
return dlist
df2=avg([21,23,24,25])
My goal there was to have all the dates of (21,23,24,25) for the May
into its own series of df.
But it was a total fail got this error
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I am not sure if it's also possible to add a rolling average or mean, to columns for each day of (21,23,24,25), but that's where analysis will conclude.
output desired
Store Date Profit Rolling Mean
ABC May 1 2018 234 250
XYZ May 1 2018 410 401
AZY May 1 2018 145 415
where the rolling mean is for the past 30 days. Above all, I would like to have each day into its own df where I can save it to csv file the end.
Rolling Mean:
The example data given in the question, has data in the format of May 1 2018, which can't be used for rolling. Rolling requires a datetime index.
Instead of string splitting the original Date column, it should be converted to datetime, using df.Date = pd.to_datetime(df.Date), which will give dates in the format 2018-05-01
With a properly formatted datetime column, use df['Day'] = df.Date.dt.day and df['Month'] = df.Date.dt.month_name() to get a Day and Month column, if desired.
Given the original data:
Original Data:
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
Transformed Original Data:
df.Date = pd.to_datetime(df.Date)
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month_name()
Store Date Profit Day Month
ABC 2018-05-01 234 1 May
XYZ 2018-05-01 410 1 May
AZY 2018-05-01 145 1 May
ABC 2018-05-02 234 2 May
XYZ 2018-05-02 410 2 May
AZY 2018-05-02 145 2 May
Rolling Example:
The example dataset is insufficient to produce a 30-day rolling average
In order to have a 30-day rolling mean, there needs to be more than 30 days of data for each store (i.e. on the 31st day, you get the 1st mean, for the previous 30 days)
The following example will setup a dataframe consisting of every day in 2018, a random profit between 100 and 1001, and a random store, chosen from ['ABC', 'XYZ', 'AZY'].
Extended Sample:
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
Store Date Profit
ABC 2018-01-01 901
AZY 2018-01-02 540
AZY 2018-01-03 417
XYZ 2018-01-04 280
XYZ 2018-01-05 384
XYZ 2018-01-06 104
XYZ 2018-01-07 691
ABC 2018-01-08 376
XYZ 2018-01-09 942
XYZ 2018-01-10 297
df.set_index('Date', inplace=True)
df_rolling = df.groupby(['Store']).rolling(30).mean()
df_rolling.rename(columns={'Profit': '30-Day Rolling Mean'}, inplace=True)
df_rolling.reset_index(inplace=True)
df_rolling.head():
Note the first 30-days for each store, will be NaN
Store Date 30-Day Rolling Mean
ABC 2018-01-01 NaN
ABC 2018-01-03 NaN
ABC 2018-01-07 NaN
ABC 2018-01-11 NaN
ABC 2018-01-13 NaN
df_rolling.tail():
Store Date 30-Day Rolling Mean
XYZ 2018-12-17 556.966667
XYZ 2018-12-18 535.633333
XYZ 2018-12-19 534.733333
XYZ 2018-12-24 551.066667
XYZ 2018-12-27 572.033333
Plot:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
g = sns.lineplot(x='Date', y='30-Day Rolling Mean', data=df_rolling, hue='Store')
for item in g.get_xticklabels():
item.set_rotation(60)
plt.show()
Alternatively: A dataframe for each store:
It's also possible to create a separate dataframe for each store and put it inside a dict
This alternative makes is easier to plot a more detailed graph with less code
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
df_dict = dict()
for store in df.Store.unique():
df_dict[store] = df[['Date', 'Profit']][df.Store == store]
df_dict[store].set_index('Date', inplace=True)
df_dict[store]['Profit: 30-Day Rolling Mean'] = df_dict[store].rolling(30).mean()
print(df_dict.keys())
>>> dict_keys(['ABC', 'XYZ', 'AZY'])
print(df_dict['ABC'].head())
Plot:
import matplotlib.pyplot as plt
_, axes = plt.subplots(1, 1, figsize=(13, 8), sharex=True)
for k, v in df_dict.items():
axes.plot(v['Profit'], marker='.', linestyle='-', linewidth=0.5, label=k)
axes.plot(v['Profit: 30-Day Rolling Mean'], marker='o', markersize=4, linestyle='-', linewidth=0.5, label=f'{k} Rolling')
axes.legend()
axes.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Profit ($)')
plt.xlabel('Date')
plt.title('Recorded Profit vs. 30-Day Rolling Mean of Profit')
plt.show()
Get a dataframe for a specific month:
Recall, this is randomly generated data, so the stores don't have data for every day of the month.
may_df = dict()
for k, v in df_dict.items():
v.reset_index(inplace=True)
may_df[k] = v[v.Date.dt.month_name() == 'May']
may_df[k].set_index('Date', inplace=True)
print(may_df['XYZ'])
Plot: May data only:
Save dataframes:
pandas.DataFrame.to_csv()
may_df.reset_index(inplace=True)
may_df.to_csv('may.csv', index=False)
A simple solution may be groupby()
Check out this example :
import pandas as pd
listt = [['a',2,3],
['b',5,7],
['a',3,9],
['a',1,3],
['b',9,4],
['a',4,7],
['c',7,2],
['a',2,5],
['c',4,7],
['b',5,5]]
my_df = pd.DataFrame(listt)
my_df.columns=['Class','Day_1','Day_2']
my_df.groupby('Class')['Day_1'].mean()
OutPut :
Class
a 2.400000
b 6.333333
c 5.500000
Name: Day_1, dtype: float64
Note : Similarly You can Group your data by Date and get Average of your Profit.