How to set data to a view of multi-indexed DataFrame? - python

I'm a newbie to pandas.
I have a DataFrame that is created by the grabbed data from my database consisting of three columns: id, date, value (only one value of each pair of id and date).
What I want to do is dividing value column by a specific number (ratio) for each id in a specific date range. As size of my data is large (>10M records) I thought setting a multiindex on my DataFrame would be a good idea. And finally here's what I've done:
df = pd.DataFrame(raw_history, columns=['id', 'date', 'value'])
df = df.set_index(['id', 'date'])
for id in ids:
ratio = calc_ratio(id)
min_date = calc_min_date(id)
history = df.loc[id]
history.loc[history.index >= pd.to_datetime(min_date)] /= ratio
df.loc[id] = history
What's the problem? It seems that I've misunderstood the concept of multiindex and df.loc[id] gets cleared after the last line. I mean after the setting, df.loc[id] returns an empty data frame.
So, what approach should I employ to get my column divided by ratio. I'm not sure if it's a good idea or not to use multiindex for my data, but performance is important.

If I understood correctly how your dataframe looks like then yes, MultiIndex is a good idea. However you don't need a for loop which is usually a good thing in Python.
You DataFrame should look something like this:
id date value
0 330 2020-03-30 03:00:00 180
1 330 2020-03-30 04:00:00 360
2 331 2020-03-30 05:00:00 120
3 331 2020-03-30 06:00:00 600
So this is what you can do:
import pandas as pd
import datetime
# Generate a sample DataFrame
ids = [330, 330, 331, 331]
df = pd.DataFrame({'id': ids,
'date': [datetime.datetime(2020, 3, 30, h) for h in range(3, 7)],
'value': [180, 360, 120, 600]})
# Set index inplace
df.set_index(['id', 'date'], inplace=True)
# Divide values by ratio only at ids where condition "date >= min_date" is satisfied
min_date = datetime.datetime(2020, 3, 30, 5)
ratio = 2
df.iloc[df.index.get_level_values(1) >= min_date] /= ratio
print(df)
Which gives you correctly:
value
id date
330 2020-03-30 03:00:00 180.0
2020-03-30 04:00:00 360.0
331 2020-03-30 05:00:00 60.0
2020-03-30 06:00:00 300.0
Also note that you can set_index without creating a copy of your DataFrame with the keyword argument inplace=True which is, of course, better for memory management especially given the size of your DataFrame.
EDIT: If ratio and min_datehave to be evaluated for each id then I don't think you can avoid the for loop. The right way to iterate through levels of a MultiIndex is with the method groupby as follows:
for id, df_id in df.groupby(level=0):
min_date = datetime.datetime(2020, 3, 30, 5)
ratio = 2
condition = df_id.index.get_level_values(1) >= min_date
df.loc[id].iloc[condition] /= ratio
which gives the same result as above with the difference that you now have ratio and min_date in the for loop.

Related

Replacing a for loop with something more efficient when comparing dates to a list

Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03

Pandas GroupBy and Mean over Date Range

I would like to compute the mean per ID using groupby and mean. However, I only need the rows where Date is between year 2016-01-01 and 2017-12-31.
d = {'ID': ['STCK123', 'STCK123', 'STCK123'], 'Amount': [250, 400, 350],
'Date': ['2016-01-20', '2017-09-25', '2018-05-15']}
data = pd.DataFrame(data=d)
data = data[['ID', 'Amount', 'Date']]
data['Date'] = pd.to_datetime(data['Date'])
This gives following df:
ID Amount Date
STCK123 250 2016-01-20
STCK123 400 2017-09-25
STCK123 350 2018-05-15
When I use:
data.groupby(['ID'])['Amount'].agg('mean')
It takes all rows into account, resulting in a mean value of 333.3. How can I exclude the rows where Date is 2018 (yielding a mean value of (250+400)/2=325)?
You'll need a pre-filtering step with query:
df.query('Date.dt.year != 2018').groupby('ID').mean()
Amount
ID
STCK123 325
More uses for eval, query, and associated parameters can be found here in my writeup: Dynamic Expression Evaluation in pandas using pd.eval()
See here for more methods on dropping rows before calling groupby.
You can also mask those rows, without having to drop them. NaNs are excluded from the GroupBy aggregation.
df.mask(df.Date.dt.year == 2018).groupby('ID').mean()
Amount
ID
STCK123 325.0

Extend a pandas dataframe to include 'missing' weeks

I have a pandas dataframe which contains time series data, so the index of the dataframe is of type datetime64 at weekly intervals, each date occurs on the Monday of each calendar week.
There are only entries in the dataframe when an order was recorded, so if there was no order placed, there isn't a corresponding record in the dataframe. I would like to "pad" this dataframe so that any weeks in a given date range are included in the dataframe and a corresponding zero quantity is entered.
I have managed to get this working by creating a dummy dataframe, which includes an entry for each week that I want with a zero quantity and then merging these two dataframes and dropping the dummy dataframe column. This results in a 3rd padded dataframe.
I don't feel this is a great solution to the problem and being new to pandas wanted to know if there is a more specific and or pythonic way to achieve this, probably without having to create a dummy dataframe and then merge.
The code I used is below to get my current solution:
# Create the dummy product
# Week hold the week date of the order, want to set this as index later
group_by_product_name = df_all_products.groupby(['Week', 'Product Name'])['Qty'].sum()
first_date = group_by_product_name.head(1) # First date in entire dataset
last_date = group_by_product_name.tail().index[-1] # last date in the data set
bdates = pd.bdate_range(start=first_date, end=last_date, freq='W-MON')
qty = np.zeros(bdates.shape)
dummy_product = {'Week':bdates, 'DummyQty':qty}
df_dummy_product = pd.DataFrame(dummy_product)
df_dummy_product.set_index('Week', inplace=True)
group_by_product_name = df_all_products.groupby('Week')['Qty'].sum()
df_temp = pd.concat([df_dummy_product, group_by_product_name], axis=1, join='outer')
df_temp.fillna(0, inplace=True)
df_temp.drop(columns=['DummyQty'], axis=1, inplace=True)
The problem with this approach is sometimes (I don't know why) the indexes don't match correctly, I think somehow the dtype of the index on one of the dataframes loses its type and goes to object instead of staying with dtype datetime64. So I am sure there is a better way to solve this problem than my current solution.
EDIT
Here is a sample dataframe with "missing entries"
df1 = pd.DataFrame({'Week':['2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-25'], 'Qty':[100, 200, 300, 500]})
df1.set_index('Week', inplace=True)
df1.head()
Here is an example of the padded dataframe that contains the additional missing dates between the date range
df_zero = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-18', '2018-06-25', '2018-07-02'], 'Dummy Qty':[0, 0, 0, 0, 0, 0, 0]})
df_zero.set_index('Week', inplace=True)
df_zero.head()
And this is the intended outcome after concatenating the two dataframes
df_padded = pd.concat([df_zero, df1], axis=1, join='outer')
df_padded.fillna(0, inplace=True)
df_padded.drop(columns=['Dummy Qty'], inplace=True)
df_padded.head(6)
Note that the missing entries are added before and between other entries where necessary in the final dataframe.
Edit 2:
As requested here is an example of what the initial product dataframe would look like:
df_all_products = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-05-21', '2018-06-11', '2018-06-18',
'2018-06-25', '2018-07-02'],
'Product Name':['A', 'A', 'B', 'A', 'B', 'A', 'A'],
'Qty':[100, 200, 300, 400, 500, 600, 700]})
Ok given your original data you can achieve the expected results by using pivot and resample for any missing weeks, like the following:
results = df_all_products.groupby(
['Week','Product Name']
)['Qty'].sum().reset_index().pivot(
index='Week',columns='Product Name', values='Qty'
).resample('W-MON').asfreq().fillna(0)
Output results:
Product Name A B
Week
2018-05-21 100.0 300.0
2018-05-28 200.0 0.0
2018-06-04 0.0 0.0
2018-06-11 400.0 0.0
2018-06-18 0.0 500.0
2018-06-25 600.0 0.0
2018-07-02 700.0 0.0
So if you want to get the df for Product Name A, you can do results['A'].

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Find maximum of column for each business quarter pandas

Assume that I have the following data set
import pandas as pd, numpy, datetime
start, end = datetime.datetime(2015, 1, 1), datetime.datetime(2015, 12, 31)
date_list = pd.date_range(start, end, freq='B')
numdays = len(date_list)
value = numpy.random.normal(loc=1e3, scale=50, size=numdays)
ids = numpy.repeat([1], numdays)
test_df = pd.DataFrame({'Id': ids,
'Date': date_list,
'Value': value})
I would now like to calculate the maximum within each business quarter for test_df. One possiblity is to use resample using rule='BQ', how='max'. However, I'd like to keep the structure of the array and just generate another column with the maximum for each BQ, have you guys got any suggestions on how to do this?
I think the following should work for you, this groups on the quarter and calls transform on the 'Value' column and returns the maximum value as a Series with it's index aligned to the original df:
In [26]:
test_df['max'] = test_df.groupby(test_df['Date'].dt.quarter)['Value'].transform('max')
test_df
Out[26]:
Date Id Value max
0 2015-01-01 1 1005.498555 1100.197059
1 2015-01-02 1 1032.235987 1100.197059
2 2015-01-05 1 986.906171 1100.197059
3 2015-01-06 1 984.473338 1100.197059
........
256 2015-12-25 1 997.965285 1145.215837
257 2015-12-28 1 929.652812 1145.215837
258 2015-12-29 1 1086.128017 1145.215837
259 2015-12-30 1 921.663949 1145.215837
260 2015-12-31 1 938.189566 1145.215837
[261 rows x 4 columns]

Categories

Resources