I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200.
I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.
Any idea to help me please?
Thank you in advance
Try this:
Using the setup from #Maxu
col = 'consumption_energy'
conditions = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices = [ "high", 'medium', 'low' ]
df2["energy_class"] = np.select(conditions, choices, default=np.nan)
consumption_energy energy_class
0 459 high
1 416 high
2 186 low
3 250 medium
4 411 high
5 210 medium
6 343 medium
7 328 medium
8 208 medium
9 223 medium
You can use a ternary:
np.where(consumption_energy > 400, 'high',
(np.where(consumption_energy < 200, 'low', 'medium')))
I like to keep the code clean. That's why I prefer np.vectorize for such tasks.
def conditions(x):
if x > 400: return "High"
elif x > 200: return "Medium"
else: return "Low"
func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])
Then just add numpy array as a column in your dataframe using:
df_energy["energy_class"] = energy_class
The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily.
Hope it helps.
I would use the cut() method here, which will generate very efficient and memory-saving category dtype:
In [124]: df
Out[124]:
consumption_energy
0 459
1 416
2 186
3 250
4 411
5 210
6 343
7 328
8 208
9 223
In [125]: pd.cut(df.consumption_energy,
[0, 200, 400, np.inf],
labels=['low','medium','high']
)
Out[125]:
0 high
1 high
2 low
3 medium
4 high
5 medium
6 medium
7 medium
8 medium
9 medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]
WARNING: Be careful with NaNs
Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.
Consider this situation:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.
IMHO best way to go is pd.cut. It deals with NaNs and easy to use.
Examples:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium
Let's start by creating a dataframe with 1000000 random numbers between 0 and 1000 to be used as test
df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})
[Out]:
consumption_energy
0 683
1 893
2 545
3 13
4 768
5 385
6 644
7 551
8 572
9 822
A bit of a description of the dataframe
print(df.energy.describe())
[Out]:
consumption_energy
count 1000000.000000
mean 499.648532
std 288.600140
min 0.000000
25% 250.000000
50% 499.000000
75% 750.000000
max 999.000000
There are various ways to achieve that, such as:
Using numpy.where
df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))
Using numpy.select
df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')
Using numpy.vectorize
df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])
Using pandas.cut
df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])
Using Python's built in modules
def energy_class(x):
if x > 400:
return 'high'
elif x > 200:
return 'medium'
else:
return 'low'
df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)
Using a lambda function
df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))
Time Comparison
From all the tests that I've done, by measuring time with time.perf_counter() (for other ways to measure time of execution see this), pandas.cut was the fastest approach.
method time
0 np.where() 0.124139
1 np.select() 0.155879
2 numpy.vectorize() 0.452789
3 pandas.cut() 0.046143
4 Python's built-in functions 0.138021
5 lambda function 0.19081
Notes:
For the difference between pandas.cut and pandas.qcut see this: What is the difference between pandas.qcut and pandas.cut?
Try this : Even if consumption_energy contains nulls don't worry about it.
def egy_class(x):
'''
This function assigns classes as per the energy consumed.
'''
return ('high' if x>400 else
'low' if x<200 else 'medium')
chk = df_energy.consumption_energy.notnull()
df_energy['energy_class'] = df_energy.consumption_energy[chk].apply(egy_class)
I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.
# Vectorizing with numpy
row_dic = {'Condition1':'high',
'Condition2':'medium',
'Condition3':'low',
'Condition4':'lowest'}
def Conditions(dfSeries_element,dictionary):
'''
dfSeries_element is an element from df_series
dictionary: is the dictionary of your conditions with their outcome
'''
if dfSeries_element in dictionary.keys():
return dictionary[dfSeries]
def VectorizeConditions():
func = np.vectorize(Conditions)
result_vector = func(df['Series'],row_dic)
df['new_Series'] = result_vector
# running the below function will apply multi conditional formatting to your df
VectorizeConditions()
myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))
when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.
Related
I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200.
I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.
Any idea to help me please?
Thank you in advance
Try this:
Using the setup from #Maxu
col = 'consumption_energy'
conditions = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices = [ "high", 'medium', 'low' ]
df2["energy_class"] = np.select(conditions, choices, default=np.nan)
consumption_energy energy_class
0 459 high
1 416 high
2 186 low
3 250 medium
4 411 high
5 210 medium
6 343 medium
7 328 medium
8 208 medium
9 223 medium
You can use a ternary:
np.where(consumption_energy > 400, 'high',
(np.where(consumption_energy < 200, 'low', 'medium')))
I like to keep the code clean. That's why I prefer np.vectorize for such tasks.
def conditions(x):
if x > 400: return "High"
elif x > 200: return "Medium"
else: return "Low"
func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])
Then just add numpy array as a column in your dataframe using:
df_energy["energy_class"] = energy_class
The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily.
Hope it helps.
I would use the cut() method here, which will generate very efficient and memory-saving category dtype:
In [124]: df
Out[124]:
consumption_energy
0 459
1 416
2 186
3 250
4 411
5 210
6 343
7 328
8 208
9 223
In [125]: pd.cut(df.consumption_energy,
[0, 200, 400, np.inf],
labels=['low','medium','high']
)
Out[125]:
0 high
1 high
2 low
3 medium
4 high
5 medium
6 medium
7 medium
8 medium
9 medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]
WARNING: Be careful with NaNs
Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.
Consider this situation:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.
IMHO best way to go is pd.cut. It deals with NaNs and easy to use.
Examples:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium
Let's start by creating a dataframe with 1000000 random numbers between 0 and 1000 to be used as test
df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})
[Out]:
consumption_energy
0 683
1 893
2 545
3 13
4 768
5 385
6 644
7 551
8 572
9 822
A bit of a description of the dataframe
print(df.energy.describe())
[Out]:
consumption_energy
count 1000000.000000
mean 499.648532
std 288.600140
min 0.000000
25% 250.000000
50% 499.000000
75% 750.000000
max 999.000000
There are various ways to achieve that, such as:
Using numpy.where
df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))
Using numpy.select
df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')
Using numpy.vectorize
df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])
Using pandas.cut
df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])
Using Python's built in modules
def energy_class(x):
if x > 400:
return 'high'
elif x > 200:
return 'medium'
else:
return 'low'
df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)
Using a lambda function
df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))
Time Comparison
From all the tests that I've done, by measuring time with time.perf_counter() (for other ways to measure time of execution see this), pandas.cut was the fastest approach.
method time
0 np.where() 0.124139
1 np.select() 0.155879
2 numpy.vectorize() 0.452789
3 pandas.cut() 0.046143
4 Python's built-in functions 0.138021
5 lambda function 0.19081
Notes:
For the difference between pandas.cut and pandas.qcut see this: What is the difference between pandas.qcut and pandas.cut?
Try this : Even if consumption_energy contains nulls don't worry about it.
def egy_class(x):
'''
This function assigns classes as per the energy consumed.
'''
return ('high' if x>400 else
'low' if x<200 else 'medium')
chk = df_energy.consumption_energy.notnull()
df_energy['energy_class'] = df_energy.consumption_energy[chk].apply(egy_class)
I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.
# Vectorizing with numpy
row_dic = {'Condition1':'high',
'Condition2':'medium',
'Condition3':'low',
'Condition4':'lowest'}
def Conditions(dfSeries_element,dictionary):
'''
dfSeries_element is an element from df_series
dictionary: is the dictionary of your conditions with their outcome
'''
if dfSeries_element in dictionary.keys():
return dictionary[dfSeries]
def VectorizeConditions():
func = np.vectorize(Conditions)
result_vector = func(df['Series'],row_dic)
df['new_Series'] = result_vector
# running the below function will apply multi conditional formatting to your df
VectorizeConditions()
myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))
when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.
I very often run into situations where I want to use columns in a pandas dataframe to map from one set of values to another and I'm never sure if I'm using pandas the way I'm supposed to. Given these inputs
import pandas as pd
df = pd.DataFrame({'A':[17, 19, 23], 'B':['x', 'y', 'z'], 'C':[90, 92, 95]})
to_map = pd.Series(['y', 'x', 'z', 'x', 'alpha'], index=[91, 90, 92, 90, 93])
and assuming df is so large that operations like df.swap_index could raise a memory error if they copy the whole dataframe, what is the recommended way to perform the following four mappings in pandas? Additionally, if the recommended approach is not the most memory efficient then what is the most efficient way? If a dict comprehension or other python built-in is more efficient that's excellent, but I don't want solutions that require additional imports.
1. to_map values from df column to df column
desired_output = to_map.map(pd.Series(df['A'].array, index=df['B']),
na_action='ignore')
print(desired_output)
91 19.0
90 17.0
92 23.0
90 17.0
93 NaN
dtype: float64
2. to_map values from df column to df.index
desired_output = to_map.map(pd.Series(df.index, index=df['B']),
na_action='ignore')
print(desired_output)
91 1.0
90 0.0
92 2.0
90 0.0
93 NaN
dtype: float64
3. to_map.index from df column to df column
desired_output = pd.Series(to_map.index)
desired_output = desired_output.where(desired_output.isin(df['C']))
desired_output = desired_output.map(pd.Series(df['A'].array, index=df['C']),
na_action='ignore')
print(desired_output)
0 NaN
1 17.0
2 19.0
3 17.0
4 NaN
dtype: float64
4. to_map.index from df column to df.index
desired_output = pd.Series(to_map.index)
desired_output = desired_output.where(desired_output.isin(df['C']))
desired_output = desired_output.map(pd.Series(df.index, index=df['C']),
na_action='ignore')
print(desired_output)
0 NaN
1 0.0
2 1.0
3 0.0
4 NaN
dtype: float64
OK, I got more interested in this so I put together a script to get maximum memory consumption for the approaches in the original post (called series.map below), #mitoRibo's answer (called reindex below), and #sammywemmy's comments (called merge below). It just makes the data in the OP 1e5 times longer and runs the operations a few times to get average values.
TL;DR
If you can get everything into an integer type then series.map always consumes the least memory, sometimes by a lot.
Otherwise, if you can't work with integers, for the cases I set up
reindex has the lowest peak memory consumption if you have a short dataframe and a long series of values to map.
series.map has the lowest peak memory consumption if you're working with a long dataframe except when taking the index of a series of values from one dataframe column to another dataframe column. Then reindex is cheaper.
merge never has the lowest peak memory consumption.
merge is your only choice if you have repeated values in the domain of the map (that would be df.B in cases 1 and 2, df.C in cases 3 and 4, or df.A in cases 5 and 6). Note this is the only choice out of the three discussed here, I haven't thought about other options.
Details
Here are the results (code at the end)
peak memory consumption in mb with a long series of values to map
case reindex series.map merge description
0 1 29.018134 29.013939 66.028864 to_map['str'], df.B -> df.A
1 2 29.027581 29.015175 70.035325 to_map['str'], df.B -> df.index
2 3 8.927531 29.516091 49.035328 to_map.index, df.C -> df.A
3 4 8.928258 29.516747 53.039901 to_map.index, df.C -> df.index
4 5 8.428121 8.018872 49.035975 to_map['int'], df.A -> df.B
5 6 8.928532 8.518677 53.039986 to_map['int'], df.A -> df.index
peak memory consumption in mb with a long dataframe for mapping
case reindex series.map merge description
0 1 24.614136 17.412829 19.867535 to_map['str'], df.B -> df.A
1 2 36.859664 17.413827 29.472225 to_map['str'], df.B -> df.index
2 3 13.510243 19.324671 19.870097 to_map.index, df.C -> df.A
3 4 33.859205 21.725148 29.473337 to_map.index, df.C -> df.index
4 5 15.910685 8.470053 19.870748 to_map['int'], df.A -> df.B
5 6 33.859534 10.869827 29.473924 to_map['int'], df.A -> df.index
peak memory consumption in mb with a long series and a long dataframe
case reindex series.map merge description
0 1 36.213309 29.013665 66.023693 to_map['str'], df.B -> df.A
1 2 38.615469 31.414951 79.629001 to_map['str'], df.B -> df.index
2 3 21.769360 29.513805 60.907156 to_map.index, df.C -> df.A
3 4 33.618402 29.725443 70.510802 to_map.index, df.C -> df.index
4 5 23.669874 16.470405 52.024282 to_map['int'], df.A -> df.B
5 6 33.618597 19.370167 61.627128 to_map['int'], df.A -> df.index
size of operands in mb
short_df 0.000318
short_to_map['str'] 0.000254
short_to_map['int'] 0.00014
long_df 34.499973
long_to_map['str'] 25.4
long_to_map['int'] 14.0
Making the operands 10 times bigger (setting num_copies to 1e6 in the code below) generally makes all the values above 10x larger within about 5%, although when there is some variance merge tends to use about 5% more than 10x the memory listed above and the other two tend to use about 5% less. The exception is using series.map with integer values (cases 5 and 6), which uses 20% less than 10x the above value for short series and long dataframes.
I used the following script in a Jupyter Notebook on a Windows 11 machine with an Intel core i7 processor and 16gb memory. The code requires python 3.4 or above, I referred to this SO post for memory profiling, and I copied randomword from this post.
import random
import string
import sys
import tracemalloc
import pandas as pd
def grab_traced_memory_and_reset(display_text=None):
current, peak = map(lambda x: x / 1e6, tracemalloc.get_traced_memory())
if display_text is not None:
print(display_text + '\n')
print('>>> current mem usage (mb):', current)
print('>>> peak since reset (mb): ', peak)
print('reset peak\n')
tracemalloc.reset_peak()
return current, peak
def run_cases(cases, name, print_values=False):
profile = pd.DataFrame({'case':range(len(cases)), name:0})
baseline = grab_traced_memory_and_reset()[0]
for i in range(len(cases)):
if print_values:
text = cases[i]
else:
text = None
desired_output = eval(cases[i])
current, peak = grab_traced_memory_and_reset(text)
profile.loc[i, name] = peak - baseline
del(desired_output)
return profile
def average_cases(cases, name, num_runs):
result = [run_cases(cases, name) for i in range(num_runs)]
return pd.concat(result).groupby(level=0).mean()
descriptions = ["to_map['str'], df.B -> df.A ",
"to_map['str'], df.B -> df.index",
"to_map.index, df.C -> df.A ",
"to_map.index, df.C -> df.index",
"to_map['int'], df.A -> df.B ",
"to_map['int'], df.A -> df.index"]
def report_results(reindex_r, merge_r, map_r):
results = reindex_r.merge(map_r).merge(merge_r)
results.loc[:, 'case'] = (results['case'] + 1).astype(int)
results['description'] = descriptions
print(results)
def to_map_index_df_col_to_col(to_map):
output = pd.Series(to_map.index)
output = output.where(output.isin(df['C']))
return output.map(pd.Series(df['A'].array, index=df['C']),
na_action='ignore')
def to_map_index_df_col_to_index(to_map):
output = pd.Series(to_map.index)
output = output.where(output.isin(df['C']))
output = output.map(pd.Series(df.index, index=df['C']),
na_action='ignore')
def randomword(length):
letters = string.ascii_lowercase + string.ascii_uppercase
return ''.join(random.choice(letters) for i in range(length))
# number of copies to make data structures bigger
num_copies = int(1e5)
short_df = pd.DataFrame({'A':[17, 19, 23], 'B':['x', 'y', 'z'], 'C':[90, 92, 95]})
long_df = pd.DataFrame({'A':[17, 19, 23] + list(range(24, num_copies*3+21)),
'B':['x', 'y', 'z'] + [randomword(10) for i in range((num_copies-1)*3)],
'C':[90, 92, 95] + list(range(3*num_copies, 6*num_copies-3))})
short_to_map = pd.DataFrame({'str':['y', 'x', 'z', 'x', 'alpha'],
'int':[19, 17, 23, 17, 43]},
index=[91, 90, 92, 90, 93])
long_to_map = pd.concat([short_to_map]*num_copies).reset_index(drop=True)
map_cases = ["to_map['str'].map(pd.Series(df['A'].array, index=df['B']), na_action='ignore')",
"to_map['str'].map(pd.Series(df.index, index=df['B']), na_action='ignore')",
"to_map_index_df_col_to_col(to_map)",
"to_map_index_df_col_to_index(to_map)",
"to_map['int'].map(pd.Series(df['B'].array, index=df['A']), na_action='ignore')",
"to_map['int'].map(pd.Series(df.index, index=df['A']), na_action='ignore')"]
reindex_cases = ["df.set_index('B')['A'].reindex(to_map['str'])",
"df.reset_index().set_index('B')['index'].reindex(to_map['str'])",
"df.set_index('C')['A'].reindex(to_map.index)",
"df.reset_index().set_index('C')['index'].reindex(to_map.index)",
"df.set_index('A')['B'].reindex(to_map['int'])",
"df.reset_index().set_index('A')['index'].reindex(to_map['int'])"]
merge_cases = ["df.merge(to_map['str'].rename('B'), how='right')['A']",
"df.reset_index().merge(to_map['str'].rename('B'), how='right')['index']",
"df.merge(pd.Series(to_map.index).rename('C'), how='right')['A']",
"df.reset_index().merge(pd.Series(to_map.index).rename('C'), how='right')['index']",
"df.merge(to_map['int'].rename('A'), how='right')['A']",
"df.reset_index().merge(to_map['int'].rename('A'), how='right')['index']"]
tracemalloc.start()
# uncomment below to see the results for individual runs
# in a single set of cases
#run_cases(reindex_cases, 'reindex', print_values=True)
#run_cases(merge_cases, 'merge', print_values=True)
#run_cases(map_cases, 'map', print_values=True)
print('peak memory consumption in mb with a long series of values to map')
df = short_df
to_map = long_to_map
reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()
print('peak memory consumption in mb with a long dataframe for mapping')
df = long_df
to_map = short_to_map
reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()
print('peak memory consumption in mb with a long series and a long dataframe')
df = long_df
to_map = long_to_map
reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()
print('size of operands in mb')
print(' short_df ', short_df.applymap(sys.getsizeof).sum().sum() / 1e6)
print("short_to_map['str'] ", short_to_map['str'].apply(sys.getsizeof).sum() / 1e6)
print("short_to_map['int'] ", short_to_map['int'].apply(sys.getsizeof).sum() / 1e6)
print(' long_df ', long_df.applymap(sys.getsizeof).sum().sum() / 1e6)
print(" long_to_map['str'] ", long_to_map['str'].apply(sys.getsizeof).sum() / 1e6)
print(" long_to_map['int'] ", long_to_map['int'].apply(sys.getsizeof).sum() / 1e6)
I would adjust the index of the df and then reindex using the to_map. I'm not sure that these are all the fastest approaches, but they're all vectorized and use core pandas functions. I also think they are pretty readable but you can break them into multiple lines. Curious to know if these are slow/fast for your use cases
#1
print(df.set_index('B')['A'].reindex(to_map))
#2 (maybe slow? not sure)
print(df.reset_index().set_index('B')['index'].reindex(to_map))
#3
print(df.set_index('C')['A'].reindex(to_map.index))
#4
print(df.reset_index().set_index('C')['index'].reindex(to_map.index))
I am trying to calculate the moving average of a very large data set. The number of rows is approx 30M. To illustrate using pandas as follows
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'], 'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean())
Here I am using pandas to calculate the moving average. Using above it takes around 20 minutes to calculate on the 30M dataset. Is there a way to leverage DASK here?
You can use Dask.delayed for your calculation. In the example below, a standard python function which contains the pandas moving average command is turned into a dask function using a #delayed decorator.
import pandas as pd
from dask import delayed
#delayed
def mov_average(x):
x['mov_avg'] = x.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
return x
df = pd.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'],
'sales': [100, 200, 300, 400, 500]})
df['mov_avg'] = df.groupby("cust_id")["sales"].apply(
lambda x: x.ewm(alpha=0.5, adjust=False).mean())
df_1 = mov_average(df).compute()
Output
df
Out[22]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
df_1
Out[23]:
cust_id sales mov_avg
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
Alternatively, you could try converting (or reading your file) into a dask data frame. The visualization of the scheduler tasks shows the parallelization of the calculations. So, if your data frame is large enough you might get a reduction in your computation time. You could also try optimizing the number of data frame partitions.
from dask import dataframe
ddf = dataframe.from_pandas(df, npartitions=3)
ddf['emv'] = ddf.groupby('cust_id')['sales'].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean()).compute().sort_index()
ddf.visualize()
ddf.compute()
cust_id sales emv
0 a 100 100.0
1 a 200 150.0
2 a 300 225.0
3 b 400 400.0
4 b 500 450.0
I have many dataframes with individual counts (e.g. df_boston below). Each row defines a data point that is uniquely identified by its marker and its point. I have a summary dataframe (df_inventory_master) that has custom bins (the points above map to the Begin-End coordinates in the master). I want to add a column to this dataframe for each individual city that sums the counts from that city in a new column. An example is shown.
Two quirks are that the the bins in the master frame can be overlapping (the count should be added to both) and that some counts may not fall in the master (the count should be ignored).
I can do this in pure Python but since the data are in dataframes it would be helpful and likely faster to do the manipulations in pandas. I'd appreciate any tips here!
This is the master frame:
>>> df_inventory_master = pd.DataFrame({'Marker': [1, 1, 1, 2],
... 'Begin': [100, 300, 500, 100],
... 'End': [200, 600, 900, 250]})
>>> df_inventory_master
Begin End Marker
0 100 200 1
1 300 600 1
2 500 900 1
3 100 250 2
This is data for one city:
>>> df_boston = pd.DataFrame({'Marker': [1, 1, 1, 1],
... 'Point': [140, 180, 250, 500],
... 'Count': [14, 600, 1000, 700]})
>>> df_boston
Count Marker Point
0 14 1 140
1 600 1 180
2 1000 1 250
3 700 1 500
This is the desired output.
- Note that the count of 700 (Marker 1, Point 500) falls in 2 master bins and is counted for both.
- Note that the count of 1000 (Marker 1, Point 250) does not fall in a master bin and is not counted.
- Note that nothing maps to Marker 2 because df_boston does not have any Marker 2 data.
>>> desired_frame
Begin End Marker boston
0 100 200 1 614
1 300 600 1 700
2 500 900 1 700
3 100 250 2 0
What I've tried: I looked at the pd.cut() function, but with the nature of the bins overlapping, and in some cases absent, this does not seem to fit. I can add the column filled with 0 values to get part of the way there but then will need to find a way to sum the data in each frame, using bins defined in the master.
>>> df_inventory_master['boston'] = pd.Series([0 for x in range(len(df_inventory_master.index))], index=df_inventory_master.index)
>>> df_inventory_master
Begin End Marker boston
0 100 200 1 0
1 300 600 1 0
2 500 900 1 0
3 100 250 2 0
Here is how I approached it, basically a *sql style left join * using the pandas merge operation, then apply() across the row axis, with a lambda to decide if the individual records are in the band or not, finally groupby and sum:
df_merged = df_inventory_master.merge(df_boston, on=['Marker'],how='left')
# logical overwrite of count
df_merged['Count'] = df_merged.apply(lambda x: x['Count'] if x['Begin'] <= x['Point'] <= x['End'] else 0 , axis=1 )
df_agged = df_merged[['Begin','End','Marker','Count']].groupby(['Begin','End','Marker']).sum()
df_agged_resorted = df_agged.sort_index(level = ['Marker','Begin','End'])
df_agged_resorted = df_agged_resorted.astype(np.int)
df_agged_resorted.columns =['boston'] # rename the count column to boston.
print df_agged_resorted
And the result is
boston
Begin End Marker
100 200 1 614
300 600 1 700
500 900 1 700
100 250 2 0
I have this code:
new_dict = {'x':[1,2,3,4,5], 'y':[11,22,33,44,55], 'val':[100, 200, 300, 400, 500]}
df = pd.DataFrame.from_dict(new_dict)
val x y
0 100 1 11
1 200 2 22
2 300 3 33
3 400 4 44
4 500 5 55
I want to be able to use values of x and y in combination as index into val,
for example
df[3][33]
300
What's the best way to achieve this? I know it must have to do with multi index, but I am not sure exactly how.
You can either define 2 boolean conditions as a mask and use with .loc:
df.loc[(df['x']==3) & (df['y']==33), 'val']
otherwise just set the index and then you can use those values to index into the df:
In [233]:
df = df.set_index(['x','y'])
df.loc[3,33]
Out[233]:
val 300
Name: (3, 33), dtype: int64
You could wrap the first version into a func quite easily
You can define a function :
new_dict = {'x':[1,2,3,4,5], 'y':[11,22,33,44,55], 'val':[100, 200, 300, 400, 500]}
df = pd.DataFrame.from_dict(new_dict)
def multindex(x,y):
return df.set_index(['x','y']).loc[x,y]
multindex(1,11) #will return '100'