Python pandas: how to vectorize this function - python

I have two DataFrames df and evol as follows (simplified for the example):
In[6]: df
Out[6]:
data year_final year_init
0 12 2023 2012
1 34 2034 2015
2 9 2019 2013
...
In[7]: evol
Out[7]:
evolution
year
2000 1.474946
2001 1.473874
2002 1.079157
...
2037 1.463840
2038 1.980807
2039 1.726468
I would like to operate the following operation in a vectorized way (current for loop implementation is just too long when I have Gb of data):
for index, row in df.iterrows():
for year in range(row['year_init'], row['year_final']):
factor = evol.at[year, 'evolution']
df.at[index, 'data'] += df.at[index, 'data'] * factor
Complexity comes from the fact that the range of year is not the same on each row...
In the above example the ouput would be:
data year_final year_init
0 163673 2023 2012
1 594596046 2034 2015
2 1277 2019 2013
(full evol dataframe for testing purpose:)
evolution
year
2000 1.474946
2001 1.473874
2002 1.079157
2003 1.876762
2004 1.541348
2005 1.581923
2006 1.869508
2007 1.289033
2008 1.924791
2009 1.527834
2010 1.762448
2011 1.554491
2012 1.927348
2013 1.058588
2014 1.729124
2015 1.025824
2016 1.117728
2017 1.261009
2018 1.705705
2019 1.178354
2020 1.158688
2021 1.904780
2022 1.332230
2023 1.807508
2024 1.779713
2025 1.558423
2026 1.234135
2027 1.574954
2028 1.170016
2029 1.767164
2030 1.995633
2031 1.222417
2032 1.165851
2033 1.136498
2034 1.745103
2035 1.018893
2036 1.813705
2037 1.463840
2038 1.980807
2039 1.726468

One vectorization approach using only pandas is to do a cartesian join between the two frames and subset. Would start out like:
df['dummy'] = 1
evol['dummy'] = 1
combined = df.merge(evol, on='dummy')
# filter date ranges, multiply etc
This will likely be faster than what you are doing, but is memory inefficient and might blow up on your real data.
If you can take on the numba dependency, something like this should be very fast - essentially a compiled version of what you are doing now. Something similar would be possible in cython as well. Note that this requires that the evol dataframe is sorted and contigous by year, that could be relaxed with modification.
import numba
#numba.njit
def f(data, year_final, year_init, evol_year, evol_factor):
data = data.copy()
for i in range(len(data)):
year_pos = np.searchsorted(evol_year, year_init[i])
n_years = year_final[i] - year_init[i]
for offset in range(n_years):
data[i] += data[i] * evol_factor[year_pos + offset]
return data
f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
Out[24]: array([ 163673, 594596044, 1277], dtype=int64)
Edit:
Some timings with your test data
In [25]: %timeit f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
15.6 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [26]: %%time
...: for index, row in df.iterrows():
...: for year in range(row['year_init'], row['year_final']):
...: factor = evol.at[year, 'evolution']
...: df.at[index, 'data'] += df.at[index, 'data'] * factor
Wall time: 3 ms

Related

For loop to exlcude percentiles in pandas dataframe based on year and region

I have a dataset consisting of days in which the average temperature is above 10 degrees Celsius for some states in the US.
I want to define the different percentiles from (1st to 99th percentile, in 5 percentile increments) for each year and state and subtract the rows that are larger than the percentile from each data frame.
x = pd.read_csv('C:/data/data_1.csv')
percentile1 = x.groupby(['STATE', 'YEAR']).quantile(0.01)
percentile5 = x.groupby(['STATE', 'YEAR']).quantile(0.05)
percentile10 = x.groupby(['STATE', 'YEAR']).quantile(0.1)
percentile15 = x.groupby(['STATE', 'YEAR']).quantile(0.15)
percentile20 = x.groupby(['STATE', 'YEAR']).quantile(0.2)
...
percentile85 = x.groupby(['STATE', 'YEAR']).quantile(0.85)
percentile90 = x.groupby(['STATE', 'YEAR']).quantile(0.90)
percentile99 = x.groupby(['STATE', 'YEAR']).quantile(0.99)
print(percentile1)
ID doy
STATE YEAR
AK 2001 1193.40 190.56
2002 1903.48 138.24
2003 2104.40 143.66
2004 1946.40 132.00
2005 2221.08 121.24
... ... ...
WY 2015 156.79 78.70
2016 114.60 83.68
2017 102.60 111.10
2018 115.04 114.51
2019 115.01 114.02
####Calculate the annual ignition timing quantiles per ecoregion
AK01 = x[(x["STATE"] == 'AK') & (x["YEAR"] == 2001)]
AK01 = AK01[AK01["doy"] >= percentile1.doy[0]]
So far I have done it like this, but it would take forever to do it like this per state, per year.
I would love to loop over this in a way so that it subsets per STATE and per YEAR.
Something like:
if
x.STATE == percentile1.index[0] and x.YEAR == percentile1.index[1]
then
x[x["doy"] >= percentile1.doy[0]]
I would eventually end up with something like a data frame with
print(df_percentile1)
STATE YEAR ID DOY
AK 2001 1 191
AK 2001 2 200
AK 2001 3 200
... ... ... ...
AK 2019 17 185
WY 2019 209 99
WY 2019 210 100
How should I incorporate all this in a for-loop?
Edit:
I think that I basically want to do this after I have reset_index() for all the percentiles:
percentile1 = percentile1.reset_index()
x = np.where((x['STATE'] == percentile1['STATE']) & (x['YEAR'] == percentile['YEAR']) & (x['doy'] <= percentile1['doy']), 'TRUE', 'FALSE')`
but I get the following error message
ValueError: Can only compare identically-labeled Series objects
However, everything is labelled identically. How should I deal with this?

data cleansing - 2 columns change the data in one column if criteria is met

I have columns with vehicle data, for vehicles greater than 1 year old with mileage less than 100 I want to replace mileage less than 100 with 1000.
my attempts -
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)], 1000
Error - AttributeError: 'tuple' object has no attribute
and
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)]
mileage_corr['mileage'].where(mileage_corr['mileage'] <= 100, 1000, inplace=True)
error -
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return self._where(
Without complete information, assuming your vehicle_data_all DataFrame looks something like this,
years mileage
0 2019 192
1 2014 78
2 2010 38
3 2018 119
4 2019 4
5 2012 122
6 2005 50
7 2015 69
8 2004 56
9 2003 194
Pandas has a way of assigning based on a filter result. This is referred to as setting values.
df.loc[condition, "field_to_change"] = desired_change
Applied to your dataframe would look something like this,
vehicle_data_all.loc[((vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)), "mileage"] = 1000
This was my result,
years mileage
0 2019 192
1 2014 1000
2 2010 1000
3 2018 119
4 2019 1000
5 2012 122
6 2005 1000
7 2015 1000
8 2004 1000
9 2003 194

Pandas percentage change using group by

Suppose I have the following DataFrame:
df = pd.DataFrame({'city': ['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'year': [2013, 2014, 2016, 2015, 2016, 2013, 2016, 2017, 2018],
'value': [10, 12, 16, 20, 21, 11, 15, 13, 16]})
And I want to find, for each city and year, what was the percentage change of value compared to the year before. My final dataframe would be:
city year value
a 2013 NaN
a 2014 0.20
a 2016 NaN
b 2015 NaN
b 2016 0.05
c 2013 NaN
d 2016 NaN
d 2017 -0.14
d 2018 0.23
I tried to use a group in city and then use apply but it didn't work:
df.groupby('city').apply(lambda x: x.sort_values('year')['value'].pct_change()).reset_index()
It didn't work because I couldn't get the year and also because this way I was considereing that I had all years for all cities, but that is not true.
EDIT: I'm not very concerned with efficiency, so any solution that solves the problem is valid for me.
Let's try lazy groupby(), use pct_change for the changes and diff to detect year jump:
groups = df.sort_values('year').groupby(['city'])
df['pct_chg'] = (groups['value'].pct_change()
.where(groups['year'].diff()==1)
)
Output:
city year value pct_chg
0 a 2013 10 NaN
1 a 2014 12 0.200000
2 a 2016 16 NaN
3 b 2015 20 NaN
4 b 2016 21 0.050000
5 c 2013 11 NaN
6 d 2016 15 NaN
7 d 2017 13 -0.133333
8 d 2018 16 0.230769
Although #Quang's answer is much more elegantly written and concise, I just add another approach using indexing.
sorted_df = df.sort_values(by=['city', 'year'])
sorted_df.loc[((sorted_df.year.diff() == 1) &
(sorted_df.city == sorted_df.city.shift(1))), 'pct_chg'] = sorted_df.value.pct_change()
my approach is faster as you can see below run on your df, but the syntax is not as pretty.
%timeit #mine
1.44 ms ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit ##Quang's
2.23 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Faster way to filter pandas DataFrame in For loop on multiple conditions

I am working with a large dataframe (~10M rows) that contains dates & textual data, and I have a list of values that I need to make some calculations per each value in that list.
For each value, I need to filter/subset my dataframe based on 4 conditions then make my calculations and move on to the next value.
Currently, ~80% of the time is spent on the filters block making the processing time extremely long duration (few hours)
What I currently have is this:
for val in unique_list: # iterate on values in a list
if val is not None or val != kip: # as long as its an acceptable value
for year_num in range(1, 6): # split by years
# filter and make intermediate df based on per value & per year calculation
cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
cond_2 = df[f'{kip}'].notna()
cond_3 = df['Date'].dt.year < 2015 + year_num
cond_4 = df['Date'].dt.year >= 2015 + year_num -1
temp_df = df[cond_1 & cond_2 & cond_3 & cond_4].copy()
condition 1 takes around 45% of the time while conditions 3 & 4 take 22% each
is there a better way to implement this?, is there a way to remove .dt and .str and use something faster?
the time on 3 values (out of thousands)
Total time: 16.338 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def get_word_counts(df, kip, unique_list):
2 # to hold predictors
3 1 1929.0 1929.0 0.0 predictors_df = pd.DataFrame(index=[f'{kip}'])
4 1 2.0 2.0 0.0 n = 0
5
6 3 7.0 2.3 0.0 for val in unique_list: # iterate on values in a list
7 3 3.0 1.0 0.0 if val is not None or val != kip: # as long as its an acceptable value
8 18 39.0 2.2 0.0 for year_num in range(1, 6): # split by years
9
10 # filter and make intermediate df based on per value & per year calculation
11 15 7358029.0 490535.3 45.0 cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
12 15 992250.0 66150.0 6.1 cond_2 = df[f'{kip}'].notna()
13 15 3723789.0 248252.6 22.8 cond_3 = df['Date'].dt.year < 2015 + year_num
14 15 3733879.0 248925.3 22.9 cond_4 = df['Date'].dt.year >= 2015 + year_num -1
The data mainly looks like this (I use only relevant columns when doing the calculations):
Date Ingredient
20 2016-07-20 Magnesium
21 2020-02-18 <NA>
22 2016-01-28 Apple;Cherry;Lemon;Olives General;Peanut Butter
23 2015-07-23 <NA>
24 2018-01-11 <NA>
25 2019-05-30 Egg;Soy;Unspecified Egg;Whole Eggs
26 2020-02-20 Chocolate;Peanut;Peanut Butter
27 2016-01-21 Raisin
28 2020-05-11 <NA>
29 2020-05-15 Chocolate
30 2019-08-16 <NA>
31 2020-03-28 Chocolate
32 2015-11-04 <NA>
33 2016-08-21 <NA>
34 2015-08-25 Almond;Coconut
35 2016-12-18 Almond
36 2016-01-18 <NA>
37 2015-11-18 Peanut;Peanut Butter
38 2019-06-04 <NA>
39 2016-04-08 <NA>
So, it looks like you really just want to split by year of the 'Date' column, and do something with each group. Also, for a large df, it is usually faster to filter what you can once beforehand, and then get a smaller one (in your example with one year worth of data), then do all your looping/extractions on the smaller df.
Without knowing much more about the data itself (C-contiguous? F-contiguous? Date-sorted?), it's hard to be sure, but I would guess that the following may prove to be faster (and it also feels more natural IMHO):
# 1. do everything you can outside the loop
# 1.a prep your patterns
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
# you meant 'and', not 'or', right?
# 1.b filter and sort the data (why sort? better mem locality)
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# 2. do one groupby by year
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year # optional, if you need it
# 2.b reuse each group as much as possible
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
# do something with temp_df ...
Example (guessing some data, really):
n = 10_000_000
str_examples = ['hello', 'world', 'hi', 'roger', 'kilo', 'zulu', None]
df = pd.DataFrame({
'Date': [pd.Timestamp('2010-01-01') + k*pd.Timedelta('1 day') for k in np.random.randint(0, 3650, size=n)],
'x': np.random.randint(0, 1200, size=n),
'foo': np.random.choice(str_examples, size=n),
'bar': np.random.choice(str_examples, size=n),
})
unique_list = ['rld', 'oger']
kip = 'foo'
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
%%time
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# CPU times: user 1.67 s, sys: 124 ms, total: 1.79 s
%%time
out = defaultdict(dict)
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
out[year].update({escval: temp_df})
# CPU times: user 2.64 s, sys: 0 ns, total: 2.64 s
Quick sniff test:
>>> out.keys()
dict_keys([2015, 2016, 2017, 2018, 2019])
>>> out[2015].keys()
dict_keys(['rld', 'oger'])
>>> out[2015]['oger'].shape
(142572, 4)
>>> out[2015]['oger'].tail()
Date x foo bar
3354886 2015-12-31 409 roger hello
8792739 2015-12-31 474 roger zulu
3944171 2015-12-31 310 roger hi
7578485 2015-12-31 125 roger None
2963220 2015-12-31 809 roger hi

Pandas dataframe slicing

I have the following dataframe:
2012 2013 2014 2015 2016 2017 2018 Kategorie
0 5.31 5.27 5.61 4.34 4.54 5.02 7.07 Gewinn pro Aktie in EUR
1 13.39 14.70 12.45 16.29 15.67 14.17 10.08 KGV
2 -21.21 -0.75 6.45 -22.63 -7.75 9.76 47.52 Gewinnwachstum
3 -17.78 2.27 -0.55 3.39 1.48 0.34 NaN PEG
Now, I am selecting only the KGV row with:
df[df["Kategorie"] == "KGV"]
Which outputs:
2012 2013 2014 2015 2016 2017 2018 Kategorie
1 13.39 14.7 12.45 16.29 15.67 14.17 10.08 KGV
How do I calculate the mean() of the last five years (2016,15,14,13,12 in this example)?
I tried
df[df["Kategorie"] == "KGV"]["2016":"2012"].mean()
but this throws a TypeError. Why can I not slice the columns here?
loc supports that type of slicing (from left to right):
df.loc[df["Kategorie"] == "KGV", "2012":"2016"].mean(axis=1)
Out:
1 14.5
dtype: float64
Note that this does not necessarily mean 2012, 2013, 2014, 2015 and 2016. These are strings so it means all columns between df['2012'] and df['2016']. There could be a column named foo in between and it would be selected.
I used filter and iloc
row = df[df.Kategorie == 'KGV']
row.filter(regex='\d{4}').sort_index(1).iloc[:, -5:].mean(1)
1 13.732
dtype: float64
Not sure why the last five years are 2012-2016 (they seem to be the first five years). Notwithstanding, to find the mean for 2012-2016 for 'KGV', you can use
df[df['Kategorie'] == 'KGV'][[c for c in df.columns if c != 'Kategorie' and 2012 <= int(c) <= 2016]].mean(axis=1)

Categories

Resources