New to multiindexing in Pandas. I have data that looks like this
Date Time value
2014-01-14 12:00:04 .424
12:01:12 .342
12:01:19 .341
...
12:05:49 .23
2014-05-12 ...
1:02:42 .23
....
For now, I want to access the last time for every single date and store the value in some array. I've made a multiindex like this
df= pd.read_csv("df.csv",index_col=0)
df.index = pd.to_datetime(df.index,infer_datetime_format=True)
df.index = pd.MultiIndex.from_arrays([df.index.date,df.index.time],names=['Date','Time'])
df= df[~df.index.duplicated(keep='first')]
dates = df.index.get_level_values(0)
So I have dates saved as an array. I want to iterate through the dates but can't either get the syntax right or am accessing the values incorrectly. I've tried a for loop but can't get it to run (for date in dates) and can't do direct access either (df.loc[dates[i]] or something like that). Also the number of time variables in each date varies. Is there any way to fix this?
This sounds like a groupby/max operation. More specifically, you want to group by the Date and aggregate the Times by taking the max. Since aggregation can only be done over column values, we'll need to change the Time index level into a column (by using reset_index):
import pandas as pd
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index('Time', drop=False)
max_times = df.groupby(level=0)['Time'].max()
print(max_times)
yields
Date
2014-01-14 12:05:49
2014-05-12 1:02:42
Name: Time, dtype: object
If you wish to select the entire row, then you could use idxmax -- but there is a caveat. idxmax returns index labels. Therefore, the index must be unique for the labels to signify unique rows. Since the Date level is not by itself unique, to use idxmax we'll need to reset_index completely (to make an index of unique integers):
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '1:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_timedelta(df['Time'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index()
idx = df.groupby(['Date'])['Time'].idxmax()
print(df.loc[idx])
yields
Date Time value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23
I don't see a good way to do this while keeping the MultiIndex.
It is easier to perform the groupby operation before setting the MultiIndex.
Moreover, it is probably preferable to preserve the datetimes as one value instead of splitting it into two parts. Note that given a datetime/period-like Series, the .dt accessor gives you easy access to the date and the time as needed. Thus you can group by the Date without making a Date column:
df = pd.DataFrame({'DateTime': ['2014-01-14 12:00:04', '2014-01-14 12:01:12', '2014-01-14 12:01:19', '2014-01-14 12:05:49', '2014-05-12 01:01:59', '2014-05-12 01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['DateTime'] = pd.to_datetime(df['DateTime'])
# df = pd.read_csv('df.csv', parse_dates=[0])
idx = df.groupby(df['DateTime'].dt.date)['DateTime'].idxmax()
result = df.loc[idx]
print(result)
yields
DateTime value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23
Related
I have a dataframe in python3 using pandas which has a column containing a string with a date.
This is the subset of the column
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"
I would like to remove the rows that have the same month and day twice and keep the one with the newest year.
This would be what I would expect as a result from this subset
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.
I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?
There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.
# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)
This is what it would look like after separation and sorting:
ColA year mo-day
2 2020-04-12 2020 04-12
3 2020-04-08 2020 04-08
4 2020-04-12 2020 04-12
0 2021-04-03 2021 04-03
1 2021-04-08 2021 04-08
Afterwards, we can simply drop the duplicates and remove the additional columns:
# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)
# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)
# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)
Final result:
ColA
0 2020-04-12
1 2020-04-08
2 2021-04-03
This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.
I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:
df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})
df['colA'] = df.colA.apply(pd.to_datetime)
Then you can group by the (day, month) and keep the highest value like so:
df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()
I have a dataframe with an id column, and a date column made up of an integer.
d = {'id': [1, 2], 'date': [20161031, 20170930]}
df = pd.DataFrame(data=d)
id date
0 1 20161031
1 2 20170930
I can convert the date column to an actual date like so.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
id date
0 1 2016-10-31
1 2 2017-09-30
But I need to have this field as a timestamp with with hours, minutes, and seconds so that it is compatible with my database table. I don't care what the the values are, we can keep it easy by setting it to zeros.
2016-10-31 00:00:00
2017-09-30 00:00:00
What is the best way to change this field to a timestamp? I tried
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d%H%M%S'))
but pandas didn't like that.
I think I could append six 0's to the end of every value in that field and then use the above statement, but I was wondering if there is a better way.
With pandas it is simpler and faster to convert entire columns. First you convert to string and then to time stamp
pandas.to_datatime(df['date'].apply(str))
PS there are few other conversion methods of varying performance https://datatofish.com/fastest-way-to-convert-integers-to-strings-in-pandas-dataframe/
The problem seems to be that pd.to_datetime doesn't accept dates in this integer format:
pd.to_datetime(20161031) gives Timestamp('1970-01-01 00:00:00.020161031')
It assumes the integers are nanoseconds since 1970-01-01.
You have to convert to a string first:
df['date'] = pd.to_datetime(df["date"].astype(str))
Output:
id date
0 1 2016-10-31
1 2 2017-09-30
Note that these are datetimes so they include a time component (which are all zero in this case) even though they are not shown in the data frame representation above.
print(df.loc[0,'date'])
Out:
Timestamp('2016-10-31 00:00:00')
You can use
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y%m%d%H%M%S'))
I would like to compute the mean per ID using groupby and mean. However, I only need the rows where Date is between year 2016-01-01 and 2017-12-31.
d = {'ID': ['STCK123', 'STCK123', 'STCK123'], 'Amount': [250, 400, 350],
'Date': ['2016-01-20', '2017-09-25', '2018-05-15']}
data = pd.DataFrame(data=d)
data = data[['ID', 'Amount', 'Date']]
data['Date'] = pd.to_datetime(data['Date'])
This gives following df:
ID Amount Date
STCK123 250 2016-01-20
STCK123 400 2017-09-25
STCK123 350 2018-05-15
When I use:
data.groupby(['ID'])['Amount'].agg('mean')
It takes all rows into account, resulting in a mean value of 333.3. How can I exclude the rows where Date is 2018 (yielding a mean value of (250+400)/2=325)?
You'll need a pre-filtering step with query:
df.query('Date.dt.year != 2018').groupby('ID').mean()
Amount
ID
STCK123 325
More uses for eval, query, and associated parameters can be found here in my writeup: Dynamic Expression Evaluation in pandas using pd.eval()
See here for more methods on dropping rows before calling groupby.
You can also mask those rows, without having to drop them. NaNs are excluded from the GroupBy aggregation.
df.mask(df.Date.dt.year == 2018).groupby('ID').mean()
Amount
ID
STCK123 325.0
I have a pandas dataframe which contains time series data, so the index of the dataframe is of type datetime64 at weekly intervals, each date occurs on the Monday of each calendar week.
There are only entries in the dataframe when an order was recorded, so if there was no order placed, there isn't a corresponding record in the dataframe. I would like to "pad" this dataframe so that any weeks in a given date range are included in the dataframe and a corresponding zero quantity is entered.
I have managed to get this working by creating a dummy dataframe, which includes an entry for each week that I want with a zero quantity and then merging these two dataframes and dropping the dummy dataframe column. This results in a 3rd padded dataframe.
I don't feel this is a great solution to the problem and being new to pandas wanted to know if there is a more specific and or pythonic way to achieve this, probably without having to create a dummy dataframe and then merge.
The code I used is below to get my current solution:
# Create the dummy product
# Week hold the week date of the order, want to set this as index later
group_by_product_name = df_all_products.groupby(['Week', 'Product Name'])['Qty'].sum()
first_date = group_by_product_name.head(1) # First date in entire dataset
last_date = group_by_product_name.tail().index[-1] # last date in the data set
bdates = pd.bdate_range(start=first_date, end=last_date, freq='W-MON')
qty = np.zeros(bdates.shape)
dummy_product = {'Week':bdates, 'DummyQty':qty}
df_dummy_product = pd.DataFrame(dummy_product)
df_dummy_product.set_index('Week', inplace=True)
group_by_product_name = df_all_products.groupby('Week')['Qty'].sum()
df_temp = pd.concat([df_dummy_product, group_by_product_name], axis=1, join='outer')
df_temp.fillna(0, inplace=True)
df_temp.drop(columns=['DummyQty'], axis=1, inplace=True)
The problem with this approach is sometimes (I don't know why) the indexes don't match correctly, I think somehow the dtype of the index on one of the dataframes loses its type and goes to object instead of staying with dtype datetime64. So I am sure there is a better way to solve this problem than my current solution.
EDIT
Here is a sample dataframe with "missing entries"
df1 = pd.DataFrame({'Week':['2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-25'], 'Qty':[100, 200, 300, 500]})
df1.set_index('Week', inplace=True)
df1.head()
Here is an example of the padded dataframe that contains the additional missing dates between the date range
df_zero = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-18', '2018-06-25', '2018-07-02'], 'Dummy Qty':[0, 0, 0, 0, 0, 0, 0]})
df_zero.set_index('Week', inplace=True)
df_zero.head()
And this is the intended outcome after concatenating the two dataframes
df_padded = pd.concat([df_zero, df1], axis=1, join='outer')
df_padded.fillna(0, inplace=True)
df_padded.drop(columns=['Dummy Qty'], inplace=True)
df_padded.head(6)
Note that the missing entries are added before and between other entries where necessary in the final dataframe.
Edit 2:
As requested here is an example of what the initial product dataframe would look like:
df_all_products = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-05-21', '2018-06-11', '2018-06-18',
'2018-06-25', '2018-07-02'],
'Product Name':['A', 'A', 'B', 'A', 'B', 'A', 'A'],
'Qty':[100, 200, 300, 400, 500, 600, 700]})
Ok given your original data you can achieve the expected results by using pivot and resample for any missing weeks, like the following:
results = df_all_products.groupby(
['Week','Product Name']
)['Qty'].sum().reset_index().pivot(
index='Week',columns='Product Name', values='Qty'
).resample('W-MON').asfreq().fillna(0)
Output results:
Product Name A B
Week
2018-05-21 100.0 300.0
2018-05-28 200.0 0.0
2018-06-04 0.0 0.0
2018-06-11 400.0 0.0
2018-06-18 0.0 500.0
2018-06-25 600.0 0.0
2018-07-02 700.0 0.0
So if you want to get the df for Product Name A, you can do results['A'].
Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df