pandas: Conditionally Aggregate Consecutive Rows - python

I have a dataframe with a consecutive index (date for every calendar day) and a reference vector that does not contain every date (only working days).
I want to reindex the dataframe to only the dates in the reference vector with the missing data being aggregated to the latest entry before a missing-date-section (i.e. weekend data shall be aggregated together to the last Friday).
Currently I have implemented this by looping over the reversed index and collecting the weekend data, then adding it later in the loop. I'm asking if there is a more efficient "array-way" to do it.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.arange(10), 'y': np.arange(10)**2},
index=pd.date_range(start="2018-01-01", periods=10))
print(df)
ref_dates = pd.date_range(start="2018-01-01", periods=10)
ref_dates = ref_dates[:5].append(ref_dates[7:]) # omit 2018-01-06 and -07
# inefficient approach by reverse-traversing the dates, collecting the data
# and aggregating it together with the first date that's in ref_dates
df.sort_index(ascending=False, inplace=True)
collector = []
for dt in df.index:
if collector and dt in ref_dates:
# data from previous iteration was collected -> aggregate it and reset collector
# first append also the current data
collector.append(df.loc[dt, :].values)
collector = np.array(collector)
# applying aggregation function, here sum as example
aggregates = np.sum(collector, axis=0)
# setting the new data
df.loc[dt,:] = aggregates
# reset collector
collector = []
if dt not in ref_dates:
collector.append(df.loc[dt, :].values)
df = df.reindex(ref_dates)
print(df)
Gives the output (first: source dataframe, second: target dataframe)
x y
2018-01-01 0 0
2018-01-02 1 1
2018-01-03 2 4
2018-01-04 3 9
2018-01-05 4 16
2018-01-06 5 25
2018-01-07 6 36
2018-01-08 7 49
2018-01-09 8 64
2018-01-10 9 81
x y
2018-01-01 0 0
2018-01-02 1 1
2018-01-03 2 4
2018-01-04 3 9
2018-01-05 15 77 # contains the sum of Jan 5th, 6th and 7th
2018-01-08 7 49
2018-01-09 8 64
2018-01-10 9 81

Still has a list comprehension loop, but works.
import pandas as pd
import numpy as np
# Create dataframe which contains all days
df = pd.DataFrame({'x': np.arange(10), 'y': np.arange(10)**2},
index=pd.date_range(start="2018-01-01", periods=10))
# create second dataframe which only contains week-days or whatever dates you need.
ref_dates = [x for x in df.index if x.weekday() < 5]
# Set the index of df to a forward filled version of the ref days
df.index = pd.Series([x if x in ref_dates else float('nan') for x in df.index]).fillna(method='ffill')
# Group by unique dates and sum
df = df.groupby(level=0).sum()
print(df)

Related

How to shift a column by 1 year in Python

With the python shift function, you are able to offset values by the number of rows. I'm looking to offset values by a specified time, which is 1 year in this case.
Here is my sample data frame. The value_py column is what I'm trying to return with a shift function. This is an over simplified example of my problem. How do I specify date as the offset parameter and not use rows?
import pandas as pd
import numpy as np
test_df = pd.DataFrame({'dt':['2020-01-01', '2020-08-01', '2021-01-01', '2022-01-01'],
'value':[10,13,15,14]})
test_df['dt'] = pd.to_datetime(test_df['dt'])
test_df['value_py'] = [np.nan, np.nan, 10, 15]
I have tried this but I'm seeing the index value get shifted by 1 year and not the value column
test_df.set_index('dt')['value'].shift(12, freq='MS')
This should solve your problem:
test_df['new_val'] = test_df['dt'].map(test_df.set_index('dt')['value'].shift(12, freq='MS'))
test_df
dt value value_py new_val
0 2020-01-01 10 NaN NaN
1 2020-08-01 13 NaN NaN
2 2021-01-01 15 10.0 10.0
3 2022-01-01 14 15.0 15.0
Use .map() to map the values of the shifted dates to original dates.
Also you should use 12 as your shift parameter not -12.

Select previous row every hour in pandas

I am trying to obtain the closest previous data point every hour in a pandas data frame. For example:
time value
0 14:59:58 15
1 15:00:10 20
2 15:57:42 14
3 16:00:30 9
would return
time value
0 15:00:00 15
1 16:00:00 14
i.e. rows 0 and 2 of the original data frame. How would I go about doing so? Thanks!
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{"time": ["14:59:58", "15:00:10", "15:57:42", "16:00:30"], "value": [15, 20, 14, 9]}
)
Here is one way to do it:
# Setup
df["time"] = pd.to_datetime(df["time"], format="%H:%M:%S")
temp_df = pd.DataFrame(df["time"].dt.round("H").drop_duplicates()).assign(value=pd.NA)
# Add round hours to df, find nearest data points and drop previous hours
new_df = (
pd.concat([df, temp_df])
.sort_values(by="time")
.fillna(method="ffill")
.pipe(lambda df_: df_[~df_["time"].isin(df["time"])])
.reset_index(drop=True)
)
# Cleanup
new_df["time"] = new_df["time"].dt.time
print(new_df)
# Output
time value
0 15:00:00 15
1 16:00:00 14

How can I replace the FOR loop by something faster

I'm trying to transform my dataframe based on certain conditions. Following is my input dataframe
In [11]: df
Out[11]:
DocumentNumber I_Date N_Date P_Date Amount
0 1234 2016-01-01 2017-01-01 2017-10-23 38.38
1 2345 2016-01-02 2017-01-02 2018-03-26 41.00
2 1324 2016-01-12 2017-01-03 2018-03-26 30.37
3 5421 2016-01-13 2017-01-02 2018-03-06 269.00
4 5532 2016-01-15 2017-01-04 2018-06-30 271.00
Desired solution:
Each row is a unique document and my aim is to find the number of documents and their total amount, which meet the mentioned condition if I am running for each day and delta combination.
I am able to get to my desired result via for-loop, but I know it is not the ideal way and gets slower as my data increases. Since I am new to python, I need help to get rid of the loop by a list comprehension or any other faster option.
Code:
d1 = datetime.date(2017, 1, 1)
d2 = datetime.date(2017, 1, 15)
mydates = pd.date_range(d1, d2).tolist()
Delta = pd.Series(range(0,5)).tolist()
df_A =[]
for i in mydates:
for j in Delta:
A = df[(df["I_Date"]<i) & (df["N_Date"]>i+j) & (df["P_Date"]>i) ]
A["DateCutoff"] = i
A["Delta"]=j
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_A.append(A)
df_A = pd.concat(df_A, sort = False)
Output:
In [14]: df_A
Out[14]:
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1
I don't see a way to remove the loop from your code, because the loop is creating individual dataframes based on the contents of mydates and Delta.
In this example you are creating 75 different dataframes
On each dataframe you .groupby, then .agg the sum of payments and the count of document numbers.
Each dataframe is appended to a list.
pd.concat the complete list into a dataframe.
One significant improvement
Check the Boolean condition before creating the dataframe and performing the remaining operations. In this example, operations were performed on 69 empty dataframes. By checking the condition first, operations will only be performed on the 6 dataframes containing data.
condition.any() returns True as long as at least one element is True
Minor changes
datetime + int is deprecated, so change that to datetime + timedelta(days=x)
pd.Series(range(0,5)).tolist() is overkill for making a list. Now timedelta objects are needed, so use [timedelta(days=x) for x in range(5)]
Instead of iterating with two for-loops, use itertools.product on mydates and Delta. This creates a generator of tuples in the form (Timestamp('2017-01-01 00:00:00', freq='D'), datetime.timedelta(0))
Use .copy() when creating dataframe A, to prevent SettingWithCopyWarning
Note:
A list comprehension was mentioned in the question. They are just a pythonic way of making a for-loop, but don't necessarily improve performance.
All of the calculations are using pandas methods, not for-loops. The for-loop only creates the dataframe from the condition.
Updated Code:
from itertools import product
import pandas as pd
from datetime import date, timedelta
d1 = date(2017, 1, 1)
d2 = date(2017, 1, 15)
mydates = pd.date_range(d1, d2)
Delta = [timedelta(days=x) for x in range(5)]
df_list = list()
for t in product(mydates, Delta):
condition = (df["I_Date"]<t[0]) & (df["N_Date"]>t[0]+t[1]) & (df["P_Date"]>t[0])
if condition.any():
A = df[condition].copy()
A["DateCutoff"] = t[0]
A["Delta"] = t[1]
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_list.append(A)
df_CutOff = pd.concat(df_list, sort = False)
Output
The same as the original
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1

How to correctly perform all-VS-all row-by-row comparisons between series in two pandas dataframes?

I have two pandas dataframes, df1 and df2. Both contain time series data.
df1
Event Number Timestamp_A
A 1 7:00
A 2 8:00
A 3 9:00
df2
Event Number Timestamp_B
B 1 9:01
B 2 8:01
B 3 7:01
Basically, I want to determine the Event B which is closest to Event A, and assign this correctly.
Therefore, I need to substract (1) every Timestamp_B in df2 from ever Timestamp_A in df1, row by row. This results in a series of values, of which I want to take the minumum and put it to a new column in df1.
Event Number Timestamp_A Closest_Timestamp_B
A 1 7:00 7:01
A 2 8:00 8:01
A 3 9:00 9:01
I am not familiar with row-by-row operations in pandas.
When I am doing:
for index, row in df1.iterrows():
s = df1.Timestamp_A.values - df2["Timestamp_B"][:]
Closest_Timestamp_B = s.min()
The result I get is a ValueError:
ValueError: operands could not be broadcast together with shapes(3,) (4,)
How to correctly perform row-by-row comparisons between two pandas dataframes?
There might be a better way to do this but here is one way:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Event':['A','A','A'],'Number':[1,2,3],
'Timestamp_A':['7:00','8:00','9:00']})
df2 = pd.DataFrame({'Event':['B','B','B'],'Number':[1,2,3],
'Timestamp_B':['7:01','8:01','9:01']})
df1['Closest_timestamp_B'] = np.zeros(len(df1.index))
for index, row in df1.iterrows():
df1['Closest_timestamp_B'].iloc[index] = df2.Timestamp_B.loc[np.argmin(np.abs(pd.to_datetime(df2.Timestamp_B) -pd.to_datetime(row.Timestamp_A)))]
df1
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01
Your best bet is to use the underlying numpy data structure to create a matrix of Timestamp_A by Timestamp_B. Since you need to compare every event in A to every event in B, this is an O(N^2) calculation, well suited for a matrix.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['A',1,'7:00'],
['A',2,'8:00'],
['A',3,'9:00']], columns=['Event', 'Number', 'Timestamp_A'])
df2 = pd.DataFrame([['B',1,'9:01'],
['B',2,'8:01'],
['B',3,'7:01']], columns=['Event', 'Number', 'Timestamp_B'])
df1.Timestamp_A = pd.to_datetime(df1.Timestamp_A)
df2.Timestamp_B = pd.to_datetime(df2.Timestamp_B)
# create a matrix with the index of df1 as the row index, and the index
# of df2 as the column index
M = df1.Timestamp_A.values.reshape((len(df1),1)) - df2.Timestamp_B.values
# use argmin to find the index of the lowest value (after abs())
index_of_B = np.abs(M).argmin(axis=0)
df1['Closest_timestamp_B'] = df2.Timestamp_B[index_of_B]
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 2017-07-05 07:00:00 2017-07-05 09:01:00
1 A 2 2017-07-05 08:00:00 2017-07-05 08:01:00
2 A 3 2017-07-05 09:00:00 2017-07-05 07:01:00
If you want to return to the original formatting for the timestamps, you can use:
df1.Timestamp_A = df1.Timestamp_A.dt.strftime('%H:%M').str.replace(r'^0','')
df1.Closest_timestamp_B = df1.Closest_timestamp_B.dt.strftime('%H:%M').str.replace(r'^0','')
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 9:01
1 A 2 8:00 8:01
2 A 3 9:00 7:01
What about using merge_asof to get the closest events?
Make sure your data types are correct:
df1.Timestamp_A = df1.Timestamp_A.apply(pd.to_datetime)
df2.Timestamp_B = df2.Timestamp_B.apply(pd.to_datetime)
Sort by the times:
df1.sort_values('Timestamp_A', inplace=True)
df2.sort_values('Timestamp_B', inplace=True)
Now you can merge the two dataframes on the closest time:
df3 = pd.merge_asof(df2, df1,
left_on='Timestamp_B',
right_on='Timestamp_A',
suffixes=('_df2', '_df1'))
#clean up the datetime formats
df3[['Timestamp_A', 'Timestamp_B']] = df3[['Timestamp_A', 'Timestamp_B']] \
.applymap(pd.datetime.time)
#put df1 columns on the right
df3 = df3.iloc[:,::-1]
print(df3)
Timestamp_A Number_df1 Event_df1 Timestamp_B Number_df2 Event_df2
0 07:00:00 1 A 07:01:00 3 B
1 08:00:00 2 A 08:01:00 2 B
2 09:00:00 3 A 09:01:00 1 B
Use apply to compare Timestamp_A on each row with all Timestamp_B and get the index of the row with min diff, then extract Timestamp_B using the index.
df1['Closest_Timestamp_B'] = (
df1.apply(lambda x: abs(pd.to_datetime(x.Timestamp_A).value -
df2.Timestamp_B.apply(lambda x: pd.to_datetime(x).value))
.idxmin(),axis=1)
.apply(lambda x: df2.Timestamp_B.loc[x])
)
df1
Out[271]:
Event Number Timestamp_A Closest_Timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01

Reading excel file in python with pandas and multiple indices

I am a python newbie so please excuse this basic question.
My .xlsx File looks like this
Unnamend:1 A Unnamend:2 B
2015-01-01 10 2015-01-01 10
2015-01-02 20 2015-01-01 20
2015-01-03 30 NaT NaN
When I read it in Python using pandas.read_excel(...) pandas automatically uses the first column as the time index.
Is there a one-liner that tells pandas to notice, that every second column is a time index belonging to the time series right next to it?
The desired output would look like this:
date A B
2015-01-01 10 10
2015-01-02 20 20
2015-01-03 30 NaN
In order to parse chunks of adjacent columns and align on their respective datetime indexes, you can do the following:
Starting with df:
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
Unnamed: 0 3 non-null datetime64[ns]
A 3 non-null int64
Unnamed: 1 2 non-null datetime64[ns]
B 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
You could iterate over chunks of 2 columns and merge on index like so:
def chunks(l, n):
""" Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
merged = df.loc[:, list(df)[:2]].set_index(list(df)[0])
for cols in chunks(list(df)[2:], 2):
merged = merged.merge(df.loc[:, cols].set_index(cols[0]).dropna(), left_index=True, right_index=True, how='outer')
to get:
A B
2015-01-01 10 10
2015-01-01 10 20
2015-01-02 20 NaN
2015-01-03 30 NaN
pd.concat unfortunately doesn't work as it can't handle duplicate index entries, otherwise one could use a list comprehension:
pd.concat([df.loc[:, cols].set_index(cols[0]) for cols in chunks(list(df), 2)], axis=1)
I use xlrd for import the data, after i use pandas to display
import xlrd
import pandas as pd
workbook = xlrd.open_workbook(xls_name)
workbook = xlrd.open_workbook(xls_name, encoding_override="cp1252")
worksheet = workbook.sheet_by_index(0)
first_row = [] # The row where we stock the name of the column
for col in range(worksheet.ncols):
first_row.append( worksheet.cell_value(0,col) )
data =[]
for row in range(10, worksheet.nrows):
elm = {}
for col in range(worksheet.ncols):
elm[first_row[col]]=worksheet.cell_value(row,col)
data.append(elm)
first_column=second_column=third_column=[]
for elm in data :
first_column.append(elm(first_row[0]))
second_column.append(elm(first_row[1]))
third_column.append(elm(first_row[2]))
dict1={}
dict1[first_row[0]]=first_column
dict1[first_row[1]]=second_column
dict1[first_row[2]]=third_column
res=pd.DataFrame(dict1, columns=['column1', 'column2', 'column3'])
print res

Categories

Resources