I would like to compute the mean per ID using groupby and mean. However, I only need the rows where Date is between year 2016-01-01 and 2017-12-31.
d = {'ID': ['STCK123', 'STCK123', 'STCK123'], 'Amount': [250, 400, 350],
'Date': ['2016-01-20', '2017-09-25', '2018-05-15']}
data = pd.DataFrame(data=d)
data = data[['ID', 'Amount', 'Date']]
data['Date'] = pd.to_datetime(data['Date'])
This gives following df:
ID Amount Date
STCK123 250 2016-01-20
STCK123 400 2017-09-25
STCK123 350 2018-05-15
When I use:
data.groupby(['ID'])['Amount'].agg('mean')
It takes all rows into account, resulting in a mean value of 333.3. How can I exclude the rows where Date is 2018 (yielding a mean value of (250+400)/2=325)?
You'll need a pre-filtering step with query:
df.query('Date.dt.year != 2018').groupby('ID').mean()
Amount
ID
STCK123 325
More uses for eval, query, and associated parameters can be found here in my writeup: Dynamic Expression Evaluation in pandas using pd.eval()
See here for more methods on dropping rows before calling groupby.
You can also mask those rows, without having to drop them. NaNs are excluded from the GroupBy aggregation.
df.mask(df.Date.dt.year == 2018).groupby('ID').mean()
Amount
ID
STCK123 325.0
Related
My dataframe has values of how many red cars are sold on a specific month. I have to build a predictive model to predict monthly sale
I want the current data frame to be converted into the format below for time series modeling.
How can I read the column and row header to create a date column? I am hoping for a new data frame.
You can use melt() to transform the dataframe from the wide to the long format. Then we combine the Year and month information to make an actual date:
import pandas as pd
df = pd.DataFrame({'YEAR' : [2021,2022],
'JAN' : [5, 232],
'FEB':[545, 48]})
df2 = df.melt(id_vars = ['YEAR'], var_name = 'month', value_name = 'sales')
df2['date'] = df2.apply(lambda row: pd.to_datetime(str(row['YEAR']) + row['month'], format = '%Y%b'), axis = 1)
df2.sort_values('date')[['date', 'sales']]
this gives the output:
date sales
0 2021-01-01 5
2 2021-02-01 545
1 2022-01-01 232
3 2022-02-01 48
(for time series analysis you would probably want to set the date column as index)
Consider this sample data created by this code:
import random
np.random.seed(0)
rng = pd.date_range('2017-09-19', periods=1000, freq='D')
randomlist = np.random.choice(1000, 10000, replace=True)
print(f'randomlist length is {len(randomlist)}')
test = pd.DataFrame({ 'id': randomlist[:(len(rng))], 'Date': rng, 'Val': np.random.randn(len(rng)) })
The desired output is a groupby id, summing all values, but only within a particular date range of the Date column. Even more complicated than that, I want to see the total Val by id for dates that are the following:
Using the date which is one month later than the earliest date for each id and one year later than that starting date of one month later than the earliest date.
So, for example, if my data appeared this way:
id Date Val
0 684 2017-09-19 0.640472
1 684 2017-10-20 -0.732568
2 501 2017-08-21 -1.141365
3 501 2017-09-22 -0.283020
4 501 2017-09-23 0.725941
5 684 2017-09-24 0.56789
I would want the groupby to only consider the dates for id 684 between 2017-10-19 (i.e. one month later than the earliest date) and 2018-10-19 (i.e. one year after the earliest date plus one month).
I have tried straight groupby and Grouper to no avail. None seem to have this ability to limit the consideration by date. Perhaps I am missing something easy? Thanks for taking a look
I'm a newbie to pandas.
I have a DataFrame that is created by the grabbed data from my database consisting of three columns: id, date, value (only one value of each pair of id and date).
What I want to do is dividing value column by a specific number (ratio) for each id in a specific date range. As size of my data is large (>10M records) I thought setting a multiindex on my DataFrame would be a good idea. And finally here's what I've done:
df = pd.DataFrame(raw_history, columns=['id', 'date', 'value'])
df = df.set_index(['id', 'date'])
for id in ids:
ratio = calc_ratio(id)
min_date = calc_min_date(id)
history = df.loc[id]
history.loc[history.index >= pd.to_datetime(min_date)] /= ratio
df.loc[id] = history
What's the problem? It seems that I've misunderstood the concept of multiindex and df.loc[id] gets cleared after the last line. I mean after the setting, df.loc[id] returns an empty data frame.
So, what approach should I employ to get my column divided by ratio. I'm not sure if it's a good idea or not to use multiindex for my data, but performance is important.
If I understood correctly how your dataframe looks like then yes, MultiIndex is a good idea. However you don't need a for loop which is usually a good thing in Python.
You DataFrame should look something like this:
id date value
0 330 2020-03-30 03:00:00 180
1 330 2020-03-30 04:00:00 360
2 331 2020-03-30 05:00:00 120
3 331 2020-03-30 06:00:00 600
So this is what you can do:
import pandas as pd
import datetime
# Generate a sample DataFrame
ids = [330, 330, 331, 331]
df = pd.DataFrame({'id': ids,
'date': [datetime.datetime(2020, 3, 30, h) for h in range(3, 7)],
'value': [180, 360, 120, 600]})
# Set index inplace
df.set_index(['id', 'date'], inplace=True)
# Divide values by ratio only at ids where condition "date >= min_date" is satisfied
min_date = datetime.datetime(2020, 3, 30, 5)
ratio = 2
df.iloc[df.index.get_level_values(1) >= min_date] /= ratio
print(df)
Which gives you correctly:
value
id date
330 2020-03-30 03:00:00 180.0
2020-03-30 04:00:00 360.0
331 2020-03-30 05:00:00 60.0
2020-03-30 06:00:00 300.0
Also note that you can set_index without creating a copy of your DataFrame with the keyword argument inplace=True which is, of course, better for memory management especially given the size of your DataFrame.
EDIT: If ratio and min_datehave to be evaluated for each id then I don't think you can avoid the for loop. The right way to iterate through levels of a MultiIndex is with the method groupby as follows:
for id, df_id in df.groupby(level=0):
min_date = datetime.datetime(2020, 3, 30, 5)
ratio = 2
condition = df_id.index.get_level_values(1) >= min_date
df.loc[id].iloc[condition] /= ratio
which gives the same result as above with the difference that you now have ratio and min_date in the for loop.
I have a pandas dataframe which contains time series data, so the index of the dataframe is of type datetime64 at weekly intervals, each date occurs on the Monday of each calendar week.
There are only entries in the dataframe when an order was recorded, so if there was no order placed, there isn't a corresponding record in the dataframe. I would like to "pad" this dataframe so that any weeks in a given date range are included in the dataframe and a corresponding zero quantity is entered.
I have managed to get this working by creating a dummy dataframe, which includes an entry for each week that I want with a zero quantity and then merging these two dataframes and dropping the dummy dataframe column. This results in a 3rd padded dataframe.
I don't feel this is a great solution to the problem and being new to pandas wanted to know if there is a more specific and or pythonic way to achieve this, probably without having to create a dummy dataframe and then merge.
The code I used is below to get my current solution:
# Create the dummy product
# Week hold the week date of the order, want to set this as index later
group_by_product_name = df_all_products.groupby(['Week', 'Product Name'])['Qty'].sum()
first_date = group_by_product_name.head(1) # First date in entire dataset
last_date = group_by_product_name.tail().index[-1] # last date in the data set
bdates = pd.bdate_range(start=first_date, end=last_date, freq='W-MON')
qty = np.zeros(bdates.shape)
dummy_product = {'Week':bdates, 'DummyQty':qty}
df_dummy_product = pd.DataFrame(dummy_product)
df_dummy_product.set_index('Week', inplace=True)
group_by_product_name = df_all_products.groupby('Week')['Qty'].sum()
df_temp = pd.concat([df_dummy_product, group_by_product_name], axis=1, join='outer')
df_temp.fillna(0, inplace=True)
df_temp.drop(columns=['DummyQty'], axis=1, inplace=True)
The problem with this approach is sometimes (I don't know why) the indexes don't match correctly, I think somehow the dtype of the index on one of the dataframes loses its type and goes to object instead of staying with dtype datetime64. So I am sure there is a better way to solve this problem than my current solution.
EDIT
Here is a sample dataframe with "missing entries"
df1 = pd.DataFrame({'Week':['2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-25'], 'Qty':[100, 200, 300, 500]})
df1.set_index('Week', inplace=True)
df1.head()
Here is an example of the padded dataframe that contains the additional missing dates between the date range
df_zero = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-18', '2018-06-25', '2018-07-02'], 'Dummy Qty':[0, 0, 0, 0, 0, 0, 0]})
df_zero.set_index('Week', inplace=True)
df_zero.head()
And this is the intended outcome after concatenating the two dataframes
df_padded = pd.concat([df_zero, df1], axis=1, join='outer')
df_padded.fillna(0, inplace=True)
df_padded.drop(columns=['Dummy Qty'], inplace=True)
df_padded.head(6)
Note that the missing entries are added before and between other entries where necessary in the final dataframe.
Edit 2:
As requested here is an example of what the initial product dataframe would look like:
df_all_products = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-05-21', '2018-06-11', '2018-06-18',
'2018-06-25', '2018-07-02'],
'Product Name':['A', 'A', 'B', 'A', 'B', 'A', 'A'],
'Qty':[100, 200, 300, 400, 500, 600, 700]})
Ok given your original data you can achieve the expected results by using pivot and resample for any missing weeks, like the following:
results = df_all_products.groupby(
['Week','Product Name']
)['Qty'].sum().reset_index().pivot(
index='Week',columns='Product Name', values='Qty'
).resample('W-MON').asfreq().fillna(0)
Output results:
Product Name A B
Week
2018-05-21 100.0 300.0
2018-05-28 200.0 0.0
2018-06-04 0.0 0.0
2018-06-11 400.0 0.0
2018-06-18 0.0 500.0
2018-06-25 600.0 0.0
2018-07-02 700.0 0.0
So if you want to get the df for Product Name A, you can do results['A'].
New to multiindexing in Pandas. I have data that looks like this
Date Time value
2014-01-14 12:00:04 .424
12:01:12 .342
12:01:19 .341
...
12:05:49 .23
2014-05-12 ...
1:02:42 .23
....
For now, I want to access the last time for every single date and store the value in some array. I've made a multiindex like this
df= pd.read_csv("df.csv",index_col=0)
df.index = pd.to_datetime(df.index,infer_datetime_format=True)
df.index = pd.MultiIndex.from_arrays([df.index.date,df.index.time],names=['Date','Time'])
df= df[~df.index.duplicated(keep='first')]
dates = df.index.get_level_values(0)
So I have dates saved as an array. I want to iterate through the dates but can't either get the syntax right or am accessing the values incorrectly. I've tried a for loop but can't get it to run (for date in dates) and can't do direct access either (df.loc[dates[i]] or something like that). Also the number of time variables in each date varies. Is there any way to fix this?
This sounds like a groupby/max operation. More specifically, you want to group by the Date and aggregate the Times by taking the max. Since aggregation can only be done over column values, we'll need to change the Time index level into a column (by using reset_index):
import pandas as pd
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index('Time', drop=False)
max_times = df.groupby(level=0)['Time'].max()
print(max_times)
yields
Date
2014-01-14 12:05:49
2014-05-12 1:02:42
Name: Time, dtype: object
If you wish to select the entire row, then you could use idxmax -- but there is a caveat. idxmax returns index labels. Therefore, the index must be unique for the labels to signify unique rows. Since the Date level is not by itself unique, to use idxmax we'll need to reset_index completely (to make an index of unique integers):
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '1:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_timedelta(df['Time'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index()
idx = df.groupby(['Date'])['Time'].idxmax()
print(df.loc[idx])
yields
Date Time value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23
I don't see a good way to do this while keeping the MultiIndex.
It is easier to perform the groupby operation before setting the MultiIndex.
Moreover, it is probably preferable to preserve the datetimes as one value instead of splitting it into two parts. Note that given a datetime/period-like Series, the .dt accessor gives you easy access to the date and the time as needed. Thus you can group by the Date without making a Date column:
df = pd.DataFrame({'DateTime': ['2014-01-14 12:00:04', '2014-01-14 12:01:12', '2014-01-14 12:01:19', '2014-01-14 12:05:49', '2014-05-12 01:01:59', '2014-05-12 01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['DateTime'] = pd.to_datetime(df['DateTime'])
# df = pd.read_csv('df.csv', parse_dates=[0])
idx = df.groupby(df['DateTime'].dt.date)['DateTime'].idxmax()
result = df.loc[idx]
print(result)
yields
DateTime value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23