Sum of same days of complex Pandas Data Frame - python

The question has a base on the following SO:
Groupy brings only one key from Pandas dictionary
Dataframe looks like:
ALUP11 Return % Day CESP6 Return % Day TAEE11 Return % Day
Data
2020-08-13 23.81 0.548986 13.0 29.38 -2.747435 13.0 28.33 -0.770578 13.0
2020-09-01 23.68 1.067008 1.0 30.21 0.365449 1.0 28.55 1.205246 1.0
2020-08-31 23.43 -1.139241 31.0 30.10 -2.336145 31.0 28.21 -0.669014 31.0
2020-08-28 23.70 1.455479 28.0 30.82 1.615562 28.0 28.40 0.459851 28.0
2020-08-27 23.36 -0.680272 27.0 30.33 -1.717434 27.0 28.27 0.354988 27.0
After having the dataframe from dictionary, I need the sum of same days but
result = df.groupby('Day').agg({'Return %': ['sum']})
result
Get error:
ValueError: Grouper for 'Day' not 1-dimensional
For each symbol I would like to sum same days of month. In the example I have 3 symbols, so the result should be like:

If your data looks like the data in the answer to your previous question, the error is because you have two columns named Day. As they appear to have the same data you could drop the last column and then your groupby will work:
df = df.iloc[:, :-1].groupby('Day')

Related

Interpolation using `asfreq('D')` in Multiindex

The following code generates two DataFrames:
frame1=pd.DataFrame({'dates':['2023-01-01','2023-01-07','2023-01-09'],'values':[0,18,28]})
frame1['dates']=pd.to_datetime(frame1['dates'])
frame1=frame1.set_index('dates')
frame2=pd.DataFrame({'dates':['2023-01-08','2023-01-12'],'values':[8,12]})
frame2['dates']=pd.to_datetime(frame2['dates'])
frame2=frame2.set_index('dates')
Using
frame1.asfreq('D').interpolate()
frame2.asfreq('D').interpolate()
we can interpolate their values between the days to obtain
and
However, consider now the concatenation table:
frame1['frame']='f1'
frame2['frame']='f2'
concat=pd.concat([frame1,frame2])
concat=concat.set_index('frame',append=True)
concat=concat.reorder_levels(['frame','dates'])
concat
I want to do the interpolation using one command like
concat.groupby('frame').apply(lambda g:g.asfreq('D').interpolate())
direktly in the concatenation table. Unfortunately, my above command does not work but raises a TypeError:
TypeError: Cannot convert input [('f1', Timestamp('2023-01-01 00:00:00'))] of type <class 'tuple'> to Timestamp
How do I fix that command to work?
You have to drop the first level index (the group key) before use asfreq like your initial dataframes:
>>> concat.groupby('frame').apply(lambda g: g.loc[g.name].asfreq('D').interpolate())
values
frame dates
f1 2023-01-01 0.0
2023-01-02 3.0
2023-01-03 6.0
2023-01-04 9.0
2023-01-05 12.0
2023-01-06 15.0
2023-01-07 18.0
2023-01-08 23.0
2023-01-09 28.0
f2 2023-01-08 8.0
2023-01-09 9.0
2023-01-10 10.0
2023-01-11 11.0
2023-01-12 12.0
To debug use a named function instead of a lambda function:
def interpolate(g):
print(f'[Group {g.name}]')
print(g.loc[g.name])
print()
return g.loc[g.name].asfreq('D').interpolate()
out = concat.groupby('frame').apply(interpolate)
Output:
[Group f1]
values
dates
2023-01-01 0
2023-01-07 18
2023-01-09 28
[Group f2]
values
dates
2023-01-08 8
2023-01-12 12

Pandas Aggregate Daily Data to Monthly Timeseries

I have a time series that looks like this (below)
And I want to resample it monthly, so it has 2019-10 is equal to the average of all the values of october, November is the average of all the PTS values for November, etc.
However, when i use the pd.resample('M').mean() method, if the final day for each month does not have a value, it fills in a Nan in my data frame. How do I solve this?
Date PTS
2019-10-23 14.0
2019-10-26 14.0
2019-10-27 8.0
2019-10-29 29.0
2019-10-31 17.0
2019-11-03 12.0
2019-11-05 2.0
2019-11-07 15.0
2019-11-08 7.0
2019-11-14 16.0
2019-11-16 12.0
2019-11-20 22.0
2019-11-22 9.0
2019-11-23 20.0
2019-11-25 18.0```
Would this work?
pd.resample('M').mean().dropna()
Do you have a code sample? This works:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
days = np.arange(31)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(days, 60),
"values": rng.integers(0, 60, size=60)})
data.set_index("dates", inplace=True)
# Set the last day to null.
data.loc["2019-03-31"] = np.nan
# This works
data.resample("M").mean()
It also works with an incomplete month:
incomplete_days = np.arange(10)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(incomplete_days, 10),
"values": rng.integers(0, 60, size=10)})
data.set_index("dates", inplace=True)
data.resample("M").mean()
You should check your data and types more thoroughly in case the NaN you're receiving indicates a more pressing issue.

Fill in dates (missing after groupby by two columns) with quantity_picked=0 in dataframe

I am working with UPC (product#), date_expected, and quantity_picked columns and need my data organized to show the total quantity_picked per day (for every day) for each UPC. Example data below:
UPC quantity_picked date_expected
0 0001111041660 1.0 2019-05-14 15:00:00
1 0001111045045 1.0 2019-05-14 15:00:00
2 0001111050268 1.0 2019-05-14 15:00:00
3 0001111086132 1.0 2019-05-14 15:00:00
4 0001111086983 1.0 2019-05-14 15:00:00
5 0001111086984 1.0 2019-05-14 15:00:00
... ... ...
39694 0004470036000 6.0 2019-06-24 20:00:00
39695 0007225001116 1.0 2019-06-24 20:00:00
I was able to successfully organize my data in this manner using the code below, but the output leaves out dates with quantity_picked=0
orders = pd.read_sql_query(SQL, con=sql_conn)
order_daily = orders.copy()
order_daily['date_expected'] = order_daily['date_expected'].dt.normalize()
order_daily['date_expected'] = pd.to_datetime(order_daily.date_expected, format='%Y-%m-%d')
# Groups by date and UPC getting the sum of quanitity picked for each
# then resets index to fill in dates for all rows
tipd = order_daily.groupby(['UPC', 'date_expected']).sum().reset_index()
# Rearranging of columns to put UPC column first
tipd = tipd[['UPC','date_expected','quantity_picked']]
gives the following output:
UPC date_expected quantity_picked
0 0000000002554 2019-05-21 4.0
1 0000000002554 2019-05-24 2.0
2 0000000002554 2019-06-02 2.0
3 0000000002554 2019-06-17 2.0
4 0000000003082 2019-05-15 2.0
5 0000000003082 2019-05-16 2.0
6 0000000003082 2019-05-17 8.0
... ... ...
31588 0360600051715 2019-06-17 1.0
31589 0501072452748 2019-06-15 1.0
31590 0880100551750 2019-06-07 2.0
When I try to follow the solution given in:
Pandas filling missing dates and values within group
I adjust my code to
tipd = order_daily.groupby(['UPC', 'date_expected']).sum().reindex(idx, fill_value=0).reset_index()
# Rearranging of columns to put UPC column first
tipd = tipd[['UPC','date_expected','quantity_picked']]
# Viewing first 10 rows to check format of dataframe
print('Preview of Total per Item per Day')
print(tipd.iloc[0:10])
And receive the following error:
TypeError: Argument 'tuples' has incorrect type (expected numpy.ndarray, got DatetimeArray)
I need each date to be listed for each product, even when quantity picked is zero. I plan on creating two new columns using .shift and .diff for calculations, and those columns will not be accurate if my data is skipping dates.
Any guidance is very much appreciated.

Pandas - Sorting a dataframe by using datetimeindex

The following is my dataframe which holds values from multiple Excel files. I wanted to do a time series analysis, so I made the index as datetimeindex. But my index is not arranged according to the date. The following is my dataframe:
Item Details Unit Op. Qty Price Op. Amt. Cl. Qty Price.1 Cl. Amt.
Month
2013-04-01 5 In 1 Pcs -56.0 172.78 -9675.58 -68.0 175.79 -11953.96
2013-04-01 Adaptor Pcs -17.0 9.00 -152.99 -17.0 9.00 -152.99
2013-04-01 Agro Tape Pcs -2.0 26.25 -52.50 -2.0 26.25 -52.50
...
2014-01-01 12" Angal Pcs -6.0 31.50 -189.00 -6.0 31.50 -189.00
2014-01-01 13 Mm Electrical Drill Check Set -1.0 247.50 -247.50 -1.0 247.50 -247.50
2014-01-01 14" Blad Pcs -5.0 157.49 -787.45 -5.0 157.49 -787.45
...
2013-09-01 Zinc Bolt 1/4 X 2"(box) Box -1.0 899.99 -899.99 -1.0 899.99 -899.99
2013-09-01 Zorik 88 32gram Pcs -1.0 45.00 -45.00 -1.0 45.00 -45.00
2013-09-01 Zorrik 311 Gram Pcs -1.0 270.01 -270.01 -1.0 270.01 -270.01
It is not sorted according to the date. I wanted to sort the index and its respective rows also. I googled it and found that there is a way to sort the datetimeindex and is as follows:
all_data.index.sort_values()
DatetimeIndex(['2013-04-01', '2013-04-01', '2013-04-01', '2013-04-01',
'2013-04-01', '2013-04-01', '2013-04-01', '2013-04-01',
'2013-04-01', '2013-04-01',
...
'2014-02-01', '2014-02-01', '2014-02-01', '2014-02-01',
'2014-02-01', '2014-02-01', '2014-02-01', '2014-02-01',
'2014-02-01', '2014-02-01'],
dtype='datetime64[ns]', name=u'Month', length=71232, freq=None)
But it is sorting only the index, how can I sort the entire dataframe according to the sorted index? Kindly help.
I think you need sort_index:
all_data = all_data.sort_index()

How to fillna/missing values for an irregular timeseries for a Drug when Half-life is known

I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64

Categories

Resources