Pandas - apply rolling to columns speed - python

I have a dataframe where I take the subset of only numeric columns, calculate the 5 day rolling average for each numeric column and add it as a new column to the df.
This approach works but currently takes quite a long time (8 seconds per column). I'm wondering if there is a better way to do this.
A working toy example of what I'm doing currently:
data = {'Group': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Year' : ['2017', '2017', '2017', '2018', '2018', '2018', '2017', '2017', '2018', '2018', '2017', '2017', '2017', '2017', '2018', '2018'],
'Score 1' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Score 2': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
for col in ['Score 1', 'Score 2']:
df[col + '_avg'] = df.groupby(['Year', 'Group'])[col].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())

For anyone who lands on this, I was able to speed this up significantly by sorting first and avoiding the lambda function:
return_df[col + '_avg'] = df.sort_values(['Group', 'Year']).groupby(['Group'])[col].rolling(2,1).mean().shift().values

Related

Sum values in one dataframe based on date range in a second dataframe

I have two dataframes (simplified examples below). One contains a series of dates and values (df1), the second contains a date range (df2). I would like to identify/select/mask the date range from df2 in df1, sum the associated df1 values and add them to a new column in df2.
I'm a novice and all the techniques I have tried have been unsuccessful--a combination of wrong method, combining incompatible methods, syntax errors and so on. I have searched the Q&As here, but none have quite addressed this issue.
import pandas as pd
#********** df1: dates and values ***********
rng = pd.date_range('2012-02-24', periods=12, freq='D')
df1 = pd.DataFrame({ 'STATCON': ['C00028', 'C00489', 'C00038', 'C00589', 'C10028', 'C00499', 'C00238', 'C00729',
'C10044', 'C00299', 'C00288', 'C00771'],
'Date': rng,
'Val': [0.96, 0.57, 0.39, 0.17, 0.93, 0.86, 0.54, 0.58, 0.43, 0.19, 0.40, 0.32]
})
#********** df2: date range ***********
df2 = pd.DataFrame({
'BCON': ['B002', 'B004', 'B005'],
'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
'End': ['2012-02-29', '2012-03-04', '2012-03-06']
})
df2[['Start','End']] = df2[['Start','End']].apply(pd.to_datetime)
#********** Desired Output: df2 -- date range with summed values ***********
df3 = pd.DataFrame({
'BCON': ['B002', 'B004', 'B005'],
'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
'End': ['2012-02-29', '2012-03-04', '2012-03-06'],
'Sum_Val': [2.92, 3.53, 2.46]
})
You can solve this with the Dataframe.apply function as follow:
def to_value(row):
return df1[(row['Start'] <= df1['Date']) & (df1['Date'] <= row['End'])]['Val'].sum()
df3 = df2.copy()
df3['Sum_Val'] = df3.apply(to_value, axis=1)
The to_value function is called on every row of the df3 dataframe.
See here for a live implementation of the solution: https://1000words-hq.com/n/TcYN1Fz6Izp
One option is with conditional_join from pyjanitor - it tries to avoid searching every row (which can be memory consuming, depending on the data size):
# pip install pyjanitor
import pandas as pd
import numpy as np
df2 = df2.astype({'Start':np.datetime64, 'End':np.datetime64})
(df1
.conditional_join(
df2,
('Date', 'Start', '>='),
('Date', 'End', '<='))
.loc[:, ['BCON', 'Start', 'End', 'Val']]
.groupby(['BCON', 'Start', 'End'], as_index = False)
.agg(sum_val = ('Val', 'sum'))
)
BCON Start End sum_val
0 B002 2012-02-25 2012-02-29 2.92
1 B004 2012-02-28 2012-03-04 3.53
2 B005 2012-03-01 2012-03-06 2.46

Identify modified rows from updated Dataframe

I collect data and analyze. In this case , there are a times data collected like yesterday or last week missing a value and might get updated when records are available at a later date, or a row value might change. I mean a row value might be modified, see sample dataframe:
First dataframe to receive
import pandas as pd
cars = {'Date': ['2020-09-11','2020-10-11','2021-01-12','2020-01-03', '2021-02-01'],
'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Mercedes'],
'Price': [22000,25000,27000,35000,45000],
'Mileage': [2000,'NAN',47000,3500,5000]
}
df = pd.DataFrame(cars, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df)
Modification done on first dataframe
import pandas as pd
cars2 = {'Date': ['2020-09-11','2020-10-11','2021-01-12','2020-01-03', '2021-02-01'],
'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Mercedes'],
'Price': [22000,5000,27000,35000,45000],
'Mileage': [2000,100,47000,3500,600]
}
df2 = pd.DataFrame(cars2, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df2)
Now I did like to know how I can select only rows modified from first dataframe. My expected output is only get rows which were modified at a later date . I have tried this but it gives me old rows too
df_diff = pd.concat([df,df2], sort=False).drop_duplicates(keep=False, inplace=False)
Expected output
import pandas as pd
cars3 = {'Date': ['2020-10-11', '2021-02-01'],
'Brand': ['Toyota Corolla','Mercedes'],
'Price': [5000,45000],
'Mileage': [100,600]
}
df3 = pd.DataFrame(cars3, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df3)
Because there are same index and columns is possible use DataFrame.ne for compare for not equal and test if at least one row True by DataFrame.any and filter in boolean indexing:
df3 = df2[df.ne(df2).any(axis=1)]
print (df3)
Date Brand Price Mileage
1 2020-10-11 Toyota Corolla 5000 100
4 2021-02-01 Mercedes 45000 600

Using dataframe index containing the year as x axis

I'm plotting a visualization with two y axes each representing a dataframe column. I used one of the dataframe's (both dataframes have the same index) index as the x-axis, however the xticks labels are not showing correctly. I should have years from 2000 to 2018
I used the following code to create the plot:
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(df1.index, df1, 'g-')
ax2.plot(df1.index, df2, 'b-')
ax1.set_xlabel('X data')
ax1.set_ylabel('Y1 data', color='g')
ax2.set_ylabel('Y2 data', color='b')
plt.show()
the index of df1 is as follows:
Index(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
'2009', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
'2017', '2018'],
dtype='object')
Here's a small snippet of the two dfs:
df1.head()
gdp
2000 1.912873
2001 7.319967
2002 3.121450
2003 5.961162
2004 4.797018
df2.head()
lifeex
2000 68.684
2001 69.193
2002 69.769
2003 70.399
2004 71.067
The plot looks like:
I tried different solutions including the one in Set Xticks frequency to dataframe index but none has succeeded to get all years showing.
I really appreciate if someone can help. thanks in advance
When I try ax1.set_xticks(df1.index) I get the following error: '<' not supported between instances of 'numpy.ndarray' and 'str'
I couldn't duplicate your issue (mpl.version = 3.2.2):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col1':np.random.randint(1,7 , 19)},
index=[str(i) for i in range(2000,2019)])
print(df1.index)
df2 = pd.Series(np.linspace(69,78, 19))
fig, ax1 = plt.subplots(figsize=(15,8))
ax2 = ax1.twinx()
ax1.plot(df1.index, df1, 'g-')
ax2.plot(df1.index, df2, 'b-')
ax1.set_xlabel('X data')
ax1.set_ylabel('Y1 data', color='g')
ax2.set_ylabel('Y2 data', color='b')
plt.show()
Output:
Index(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
'2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017',
'2018'],
dtype='object')
The following code solved the problem for me:
years = list(df1.index)
for i in range(0, len(years)):
years[i] = int(years[i])
ax1.xaxis.set_ticks(years)

Chanding a dataframe values from int type to date

I have the following dataframe
df = pd.DataFrame({
'date': [1988, 1988, 2000, 2005],
'value': [2100, 4568, 7896, 68909]
})
I want to make a time series based on this df. How can i change the year from int to a datetimeindex so i can plot a timeseries?
Use: pd.to_datetime to convert to datetime. DataFrame.set_index in order to get and plot the Series. You can plot it with Series.plot
(df.assign(date = pd.to_datetime(df['date'],format = '%Y'))
.set_index('date')['value']
.plot())
If you want keep the series use:
s = (df.assign(date = pd.to_datetime(df['date'],format = '%Y'))
.set_index('date')['value'])
and then:
s.plot()
df = pd.DataFrame({
'date': [1988, 1988, 2000, 2005],
'value': [2100, 4568, 7896, 68909]
})
date = []
for year in df.date:
date.append(pd.datetime(year,1,1))
df.index=date
df['value'].plot()

Convert a Pandas Dataframe column with dictionaries with common keys as elements to a separate data frame using the common

This question is an extension from a question I posted here a while ago. I'm trying to understand the accepted answer provided by #patrickjlong1 (thanks again), therefore I'm running the code step by step and checking the result.
I found it hard to fathom this part.
>>> df_initial
data seriesID
0 {'year': '2017', 'period': 'M12', 'periodName'... SMS42000000000000001
1 {'year': '2017', 'period': 'M11', 'periodName'... SMS42000000000000001
2 {'year': '2017', 'period': 'M10', 'periodName'... SMS42000000000000001
3 {'year': '2017', 'period': 'M09', 'periodName'... SMS42000000000000001
4 {'year': '2017', 'period': 'M08', 'periodName'... SMS42000000000000001
5 {'year': '2017', 'period': 'M07', 'periodName'... SMS42000000000000001
The element in each row of the first column is a dictionary and they all have common keys: 'year', 'period' etc. What I want to convert it to is:
footnotes period periodName value year
0 {} M12 December 6418025 2017
0 {} M11 November 6418195 2017
0 {} M10 October 6418284 2017
...
The solution provided by #patrickjlong1 is to convert the row one at a time and then append them all, which I understand as one dictionary can be converted to one dataframe:
for i in range(0, len(df_initial)):
df_row = pd.DataFrame(df_initial['data'][i])
df_row['seriesID'] = series_col
df = df.append(df_row, ignore_index=True)
My question is: is this the only way to convert the data like I wanted? If not, what are the other methods?
Thanks
Avoid pd.DataFrame.append in a loop
I can't stress this enough. The pd.DataFrame.append method is expensive as it copies data unnecessarily. Putting this in a loop makes it n times more expensive.
Instead, you can feed a list of dictionaries to the pd.DataFrame constructor:
df = pd.DataFrame(df_initial['seriesID'].tolist())

Categories

Resources