The following code generates two DataFrames:
frame1=pd.DataFrame({'dates':['2023-01-01','2023-01-07','2023-01-09'],'values':[0,18,28]})
frame1['dates']=pd.to_datetime(frame1['dates'])
frame1=frame1.set_index('dates')
frame2=pd.DataFrame({'dates':['2023-01-08','2023-01-12'],'values':[8,12]})
frame2['dates']=pd.to_datetime(frame2['dates'])
frame2=frame2.set_index('dates')
Using
frame1.asfreq('D').interpolate()
frame2.asfreq('D').interpolate()
we can interpolate their values between the days to obtain
and
However, consider now the concatenation table:
frame1['frame']='f1'
frame2['frame']='f2'
concat=pd.concat([frame1,frame2])
concat=concat.set_index('frame',append=True)
concat=concat.reorder_levels(['frame','dates'])
concat
I want to do the interpolation using one command like
concat.groupby('frame').apply(lambda g:g.asfreq('D').interpolate())
direktly in the concatenation table. Unfortunately, my above command does not work but raises a TypeError:
TypeError: Cannot convert input [('f1', Timestamp('2023-01-01 00:00:00'))] of type <class 'tuple'> to Timestamp
How do I fix that command to work?
You have to drop the first level index (the group key) before use asfreq like your initial dataframes:
>>> concat.groupby('frame').apply(lambda g: g.loc[g.name].asfreq('D').interpolate())
values
frame dates
f1 2023-01-01 0.0
2023-01-02 3.0
2023-01-03 6.0
2023-01-04 9.0
2023-01-05 12.0
2023-01-06 15.0
2023-01-07 18.0
2023-01-08 23.0
2023-01-09 28.0
f2 2023-01-08 8.0
2023-01-09 9.0
2023-01-10 10.0
2023-01-11 11.0
2023-01-12 12.0
To debug use a named function instead of a lambda function:
def interpolate(g):
print(f'[Group {g.name}]')
print(g.loc[g.name])
print()
return g.loc[g.name].asfreq('D').interpolate()
out = concat.groupby('frame').apply(interpolate)
Output:
[Group f1]
values
dates
2023-01-01 0
2023-01-07 18
2023-01-09 28
[Group f2]
values
dates
2023-01-08 8
2023-01-12 12
Related
I have a pd.Series object with a pd.DatetimeIndex containing dates. I would like to calculate the difference from a past value, for example the one month before. The values are not exactly aligned to the months, so I cannot simply add a monthly date offset. There might also be missing data.
So I would like to match the previous value using an offset and a tolerance. One way to do this is using the .reindex() method with method='nearest' which matches the previous data point almost like I want to:
shifted = data.copy()
shifted.index = shifted.index + pd.DateOffset(months=1)
shifted = shifted.reindex(
data.index,
method="nearest",
tolerance=timedelta(days=100),
)
return data - shifted
Here we calculate the difference from the value one month before, but we tolerate finding a value 100 days around that timestamp.
This is almost what I want, but I want to avoid subtracting the value from itself. I always want to subtract a value in the past, or no value at all.
For example: if this is the data
2020-01-02 1.0
2020-02-03 2.0
2020-04-05 3.0
And I use the code above, the last data point, 3.0 will be subtracted from itself, since its date is closer to 2020-05-05 than to 2020-03-03. And the result will be
2020-01-02 0.0
2020-02-03 1.0
2020-04-05 0.0
While the goal is to get
2020-01-02 NaN
2020-02-03 1.0
2020-04-05 1.0
Additional edit after Baron Legendre's answer (thanks for pointing out the flaw in my question):
The tolerance variable is also important to me. So let's say there is a gap of a year in the data, that falls outside the tolerance of 100 days, and the result should be NaN:
2015-12-04 10.0
2020-01-02 1.0
2020-02-03 2.0
2020-04-05 3.0
Should result in:
2015-12-05 NaN (because there is no past value to subtract)
2020-01-02 NaN (because the past value is too far back)
2020-02-03 1.0
2020-04-05 1.0
Hope that explains the problem well enough. Any ideas on how to do this efficiently, without looping over every single data point?
ser
###
2015-12-04 10
2020-01-02 1
2020-02-03 2
2020-04-05 3
dtype: int64
df = ser.reset_index()
tdiff = df['index'].diff().dt.days
ser[:] = np.where(tdiff > 100, np.nan, ser - ser.shift())
ser
###
2015-12-04 NaN
2020-01-02 NaN
2020-02-03 1.0
2020-04-05 1.0
dtype: float64
I am having some trouble managing and combining columns in order to get one datetime column out of three columns containing the date, the hours and the minutes.
Assume the following df (copy and type df= = pd.read_clipboard() to reproduce) with the types as noted below:
>>>df
date hour minute
0 2021-01-01 7.0 15.0
1 2021-01-02 3.0 30.0
2 2021-01-02 NaN NaN
3 2021-01-03 9.0 0.0
4 2021-01-04 4.0 45.0
>>>df.dtypes
date object
hour float64
minute float64
dtype: object
I want to replace the three columns with one called 'datetime' and I have tried a few things but I face the following problems:
I first create a 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time and then I try to concatenate it with the 'date' df['datetime']= df['date'] + ' ' + df['time'] (with the purpose of converting the 'datetime' column pd.to_datetime(df['datetime']). However, I get
TypeError: can only concatenate str (not "datetime.time") to str
If I convert 'hour' and 'minute' to str to concatenate the three columns to 'datetime', then I face the problem with the NaN values, which prevents me from converting the 'datetime' to the corresponding type.
I have also tried to first convert the 'date' column df['date']= df['date'].astype('datetime64[ns]') and again create the 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time to combine the two: df['datetime']= pd.datetime.combine(df['date'],df['time']) and it returns
TypeError: combine() argument 1 must be datetime.date, not Series
along with the warning
FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Is there a generic solution to combine the three columns and ignore the NaN values (assume it could return 00:00:00).
What if I have a row with all NaN values? Would it possible to ignore all NaNs and 'datetime' be NaN for this row?
Thank you in advance, ^_^
First convert date to datetimes and then add hour and minutes timedeltas with replace missing values to 0 timedelta:
td = pd.Timedelta(0)
df['datetime'] = (pd.to_datetime(df['date']) +
pd.to_timedelta(df['hour'], unit='h').fillna(td) +
pd.to_timedelta(df['minute'], unit='m').fillna(td))
print (df)
date hour minute datetime
0 2021-01-01 7.0 15.0 2021-01-01 07:15:00
1 2021-01-02 3.0 30.0 2021-01-02 03:30:00
2 2021-01-02 NaN NaN 2021-01-02 00:00:00
3 2021-01-03 9.0 0.0 2021-01-03 09:00:00
4 2021-01-04 4.0 45.0 2021-01-04 04:45:00
Or you can use Series.add with fill_value=0:
df['datetime'] = (pd.to_datetime(df['date'])
.add(pd.to_timedelta(df['hour'], unit='h'), fill_value=0)
.add(pd.to_timedelta(df['minute'], unit='m'), fill_value=0))
I would recommend converting hour and minute columns to string and constructing the datetime string from the provided components.
Logically, you need to perform the following steps:
Step 1. Fill missing values for hour and minute with zeros.
df['hour'] = df['hour'].fillna(0)
df['minute'] = df['minute'].fillna(0)
Step 2. Convert float values for hour and minute into integer ones, because your final output should look like 2021-01-01 7:15, not 2021-01-01 7.0:15.0.
df['hour'] = df['hour'].astype(int)
df['minute'] = df['minute'].astype(int)
Step 3. Convert integer values for hour and minute to the string representation.
df['hour'] = df['hour'].astype(str)
df['minute'] = df['minute'].astype(str)
Step 4. Concatenate date, hour and minute into one column of the correct format.
df['result'] = df['date'].str.cat(df['hour'].str.cat(df['minute'], sep=':'), sep=' ')
Step 5. Convert your result column to datetime object.
pd.to_datetime(df['result'])
It is also possible to fullfill all of this steps in one command, though it will read a bit messy:
df['result'] = pd.to_datetime(df['date'].str.cat(df['hour'].fillna(0).astype(int).astype(str).str.cat(df['minute'].fillna(0).astype(int).astype(str), sep=':'), sep=' '))
Result:
date hour minute result
0 2020-01-01 7.0 15.0 2020-01-01 07:15:00
1 2020-01-02 3.0 30.0 2020-01-02 03:30:00
2 2020-01-02 NaN NaN 2020-01-02 00:00:00
3 2020-01-03 9.0 0.0 2020-01-03 09:00:00
4 2020-01-04 4.0 45.0 2020-01-04 04:45:00
The question has a base on the following SO:
Groupy brings only one key from Pandas dictionary
Dataframe looks like:
ALUP11 Return % Day CESP6 Return % Day TAEE11 Return % Day
Data
2020-08-13 23.81 0.548986 13.0 29.38 -2.747435 13.0 28.33 -0.770578 13.0
2020-09-01 23.68 1.067008 1.0 30.21 0.365449 1.0 28.55 1.205246 1.0
2020-08-31 23.43 -1.139241 31.0 30.10 -2.336145 31.0 28.21 -0.669014 31.0
2020-08-28 23.70 1.455479 28.0 30.82 1.615562 28.0 28.40 0.459851 28.0
2020-08-27 23.36 -0.680272 27.0 30.33 -1.717434 27.0 28.27 0.354988 27.0
After having the dataframe from dictionary, I need the sum of same days but
result = df.groupby('Day').agg({'Return %': ['sum']})
result
Get error:
ValueError: Grouper for 'Day' not 1-dimensional
For each symbol I would like to sum same days of month. In the example I have 3 symbols, so the result should be like:
If your data looks like the data in the answer to your previous question, the error is because you have two columns named Day. As they appear to have the same data you could drop the last column and then your groupby will work:
df = df.iloc[:, :-1].groupby('Day')
I am working with UPC (product#), date_expected, and quantity_picked columns and need my data organized to show the total quantity_picked per day (for every day) for each UPC. Example data below:
UPC quantity_picked date_expected
0 0001111041660 1.0 2019-05-14 15:00:00
1 0001111045045 1.0 2019-05-14 15:00:00
2 0001111050268 1.0 2019-05-14 15:00:00
3 0001111086132 1.0 2019-05-14 15:00:00
4 0001111086983 1.0 2019-05-14 15:00:00
5 0001111086984 1.0 2019-05-14 15:00:00
... ... ...
39694 0004470036000 6.0 2019-06-24 20:00:00
39695 0007225001116 1.0 2019-06-24 20:00:00
I was able to successfully organize my data in this manner using the code below, but the output leaves out dates with quantity_picked=0
orders = pd.read_sql_query(SQL, con=sql_conn)
order_daily = orders.copy()
order_daily['date_expected'] = order_daily['date_expected'].dt.normalize()
order_daily['date_expected'] = pd.to_datetime(order_daily.date_expected, format='%Y-%m-%d')
# Groups by date and UPC getting the sum of quanitity picked for each
# then resets index to fill in dates for all rows
tipd = order_daily.groupby(['UPC', 'date_expected']).sum().reset_index()
# Rearranging of columns to put UPC column first
tipd = tipd[['UPC','date_expected','quantity_picked']]
gives the following output:
UPC date_expected quantity_picked
0 0000000002554 2019-05-21 4.0
1 0000000002554 2019-05-24 2.0
2 0000000002554 2019-06-02 2.0
3 0000000002554 2019-06-17 2.0
4 0000000003082 2019-05-15 2.0
5 0000000003082 2019-05-16 2.0
6 0000000003082 2019-05-17 8.0
... ... ...
31588 0360600051715 2019-06-17 1.0
31589 0501072452748 2019-06-15 1.0
31590 0880100551750 2019-06-07 2.0
When I try to follow the solution given in:
Pandas filling missing dates and values within group
I adjust my code to
tipd = order_daily.groupby(['UPC', 'date_expected']).sum().reindex(idx, fill_value=0).reset_index()
# Rearranging of columns to put UPC column first
tipd = tipd[['UPC','date_expected','quantity_picked']]
# Viewing first 10 rows to check format of dataframe
print('Preview of Total per Item per Day')
print(tipd.iloc[0:10])
And receive the following error:
TypeError: Argument 'tuples' has incorrect type (expected numpy.ndarray, got DatetimeArray)
I need each date to be listed for each product, even when quantity picked is zero. I plan on creating two new columns using .shift and .diff for calculations, and those columns will not be accurate if my data is skipping dates.
Any guidance is very much appreciated.
I have a Python Dataframe that looks like this:
Facility PUE PUEraw Servers
2016-11-14 00:00:00 6.0 NaN 1.2 5.0
2016-11-14 00:30:00 6.0 NaN 1.2 5.0
2016-11-14 01:00:00 6.0 NaN 1.2 5.0
etc.
As you can see, the index is date/time. The dataframe is updated with a new value every half hour.
I'm trying to write a script that removes all rows except those that correspond to TODAY's date, for which I am utilising date = dt.datetime.today(). However, I am struggling, partly perhaps because the index also contains the time.
Does anyone have any suggestions? Alternatively, a script that removes all but the last 48 rows would also work for me (the last 48 x half hourly values = the latest day's data).
Here are two options you can use to extract data on a specific day:
df['2016-11-16']
# Facility PUE PUEraw Servers
# 2016-11-16 01:00:00 6.0 NaN 1.2 5.0
import datetime
df[df.index.date == datetime.datetime.today().date()]
# Facility PUE PUEraw Servers
# 2016-11-16 01:00:00 6.0 NaN 1.2 5.0
You can always access the last rows in a DataFrame with df.tail()
df = df.tail(48)
For further information:
Pandas Documentation