A
0 31.353
1 28.945
2 17.377
I want to create a new df["B"] with A column values round up to 5.
The desired output:
A B
0 31.353 35.0
1 28.945 30.0
2 17.377 20.0
I´ve tried:
def roundup5(x):
return int(math.ceil(x / 5.0)) * 5
df["B"] = df["A"].apply(roundup5)
I get:
TypeError: unsupported operand type(s) for /: 'str' and 'float'
I think you need convert values to floats first, then divide and use numpy.ceil with multiple:
df["B"] = df["A"].astype(float).div(5.0).apply(np.ceil).mul(5)
df["B"] = np.ceil(df["A"].astype(float).div(5.0)).mul(5)
Loop version:
def roundup5(x):
return int(math.ceil(float(x) / 5.0)) * 5.0
df["B"] = df["A"].apply(roundup5)
print (df)
A B
0 31.353 35.0
1 28.945 30.0
2 17.377 20.0
Timings:
[30000 rows x 1 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [327]: %timeit df["B1"] = df["A"].apply(roundup5)
35.7 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [328]: %timeit df["B2"] = df["A"].astype(float).div(5.0).apply(np.ceil).mul(5)
1.25 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [329]: %timeit df["B3"] = np.ceil(df["A"].astype(float).div(5.0)).mul(5)
1.19 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
I want to calculate daily bond returns from clean prices based on the logarithm of the bond price in t divided by the bond price in t-1. So far, I calculate it like this:
import pandas as pd
import numpy as np
#create example data
col1 = np.random.randint(0,10,size=10)
df = pd.DataFrame()
df["col1"] = col1
df["result"] = [0]*len(df)
#slow computation
for i in range(len(df)):
if i == 0:
df["result"][i] = np.nan
else:
df["result"][i] = np.log(df["col1"][i]/df["col1"][i-1])
However, since I have a large sample this takes a lot of time to compute. Is there a way to improve the code in order to make it faster?
Use Series.shift by col1 column with Series.div for division:
df["result1"] = np.log(df["col1"].div(df["col1"].shift()))
#alternative
#df["result1"] = np.log(df["col1"] / df["col1"].shift())
print (df)
col1 result result1
0 5 NaN NaN
1 0 -inf -inf
2 3 inf inf
3 3 0.000000 0.000000
4 7 0.847298 0.847298
5 9 0.251314 0.251314
6 3 -1.098612 -1.098612
7 5 0.510826 0.510826
8 2 -0.916291 -0.916291
9 4 0.693147 0.693147
I test both solutions:
np.random.seed(0)
col1 = np.random.randint(0,10,size=10000)
df = pd.DataFrame({'col1':col1})
In [128]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
865 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [129]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
1.16 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
1.03 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(0)
col1 = np.random.randint(0,10,size=100000)
df = pd.DataFrame({'col1':col1})
In [132]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
3.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [133]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
6.31 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
3.75 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
No need to use multiple functions, use Series.pct_change():
df = df.assign(
result=lambda x: np.log(x.col1.pct_change() + 1)
)
print(df)
col1 result
0 3 NaN
1 5 0.510826
2 8 0.470004
3 7 -0.133531
4 9 0.251314
5 1 -2.197225
6 1 0.000000
7 2 0.693147
8 7 1.252763
9 0 -inf
This should be a much faster way to get the same results:
df["result_2"] = np.log(df["col1"] / df["col1"].shift())
Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})
I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.
I could iteratively derive a mask and use it in loc, like
m = np.array([isinstance(v, int) for v in df['mixed']])
df.loc[m, 'mixed'] *= 10
# df
# mixed
# 0 2020-10-04 00:00:00
# 1 9990
# 2 a string
That does the trick but I was wondering if there was a more pandastic way of doing this?
One idea is test if numeric by to_numeric with errors='coerce' and for non missing values:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Unfortunately is is slow, some another ideas:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.
Still need call type
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
If you want to multiple all 'numbers' then you can use the following.
Let's use pd.to_numeric with parameter errors = 'coerce' and fillna:
df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
df
Output:
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Let's add a float to the column
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string', 100.3]})
Using #BenYo:
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
Output (note only the integer 999 is multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 100.3
Using #jezrael and similiarly this solution:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print(df)
# Or this solution
# df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
Output (note all numbers are multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 1003
If you do many calculation and have a littile more memory, I suggest you to add a column to indicate the type of the mixed, for better efficiency. After you construct this column, the calculation is much faster.
here's the code:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
df["mixed_type"] = df.mixed.map(lambda x: type(x).__name__).astype('category')
m = df.mixed_type == 'int'
df.loc[m, "mixed"] *= 10
del df["mixed_type"] # after you finish all your calculation
the mixed_type column repr is
0 Timestamp
1 int
2 str
3 Timestamp
4 int
...
2999995 int
2999996 str
2999997 Timestamp
2999998 int
2999999 str
Name: mixed, Length: 3000000, dtype: category
Categories (3, object): [Timestamp, int, str]
and here's the timeit
>>> %timeit df.mixed_type == 'int'
472 µs ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.12 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For not very long data frames I can suggest this way as well:
df = df.assign(mixed = lambda x: x.apply(lambda s: s['mixed']*10 if isinstance(s['mixed'], int) else s['mixed'],axis=1))
Given the following data, where 3 means yes and 2 means no
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
which looks as
v_1 v_2 v_3
0 2 2 3
1 2 3 2
2 3 2 2
I would like to create the following series
0 v_3
1 v_2
2 v_1
All I cna think of is the following:
t['V'] = t.sum().reset_index(drop=True)
which gives
v_1 v_2 v_3 V
0 v_3 v_1
1 v_2 v_2
2 v_1 v_3
I'm wondering if there's a nicer approach than this, or perhaps more general.
Perhaps this is what you need, to keep the 3s and concat them in a series?
(
t.apply(lambda x: np.where(x.eq(3), x.name, None))
.stack().reset_index(drop=True)
)
0 v_3
1 v_2
2 v_1
dtype: object
Give this a whirl :
(t
.stack()
.droplevel(0)
.loc[lambda x: x.eq(3)]
.reset_index(name='temp')
.drop('temp',axis=1)
)
index
0 v_3
1 v_2
2 v_1
Use DataFrame.where for replace non 3 values to missing values, then reshape by DataFrame.stack, remove first level of MultiIndex and last create Series from index if performance is important:
s = pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
#alternative
#s = pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
print (s)
0 v_3
1 v_2
2 v_1
dtype: object
Details:
print (t.where(t.eq(3)))
v_1 v_2 v_3
0 NaN NaN 3.0
1 NaN 3.0 NaN
2 3.0 NaN NaN
print (t.where(t.eq(3)).stack())
0 v_3 3.0
1 v_2 3.0
2 v_1 3.0
dtype: float64
print (t.where(t.eq(3)).stack().droplevel(0))
v_3 3.0
v_2 3.0
v_1 3.0
dtype: float64
Performance for 1k rows and 10 columns:
np.random.seed(123)
t = pd.DataFrame(np.random.choice([2,3], (1000, 10))).add_prefix('v_')
#print (t)
In [25]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
2.66 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
2.61 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [27]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
5.98 ms ± 46.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [28]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
3.48 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Performance for 100k rows and 10 columns:
t = pd.DataFrame(np.random.choice([2,3], (100000, 10))).add_prefix('v_')
print (t)
In [30]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
84.7 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
84.1 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [32]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
147 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [33]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
101 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can create a new index that has the location of 3 for each column. Then you apply that index to your column names.
import pandas as pd
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
index_list = [t[t[col]==3].index[0] for col in t.columns] # create new index
series = pd.Series(t.columns) # series of column names
series.index = index_list # apply index to column names
print(series.sort_index())
Suppose we have a table of customers and their spending.
import pandas as pd
df = pd.DataFrame({
"Name": ["Alice", "Bob", "Bob", "Charles"],
"Spend": [3, 5, 7, 9]
})
LIMIT = 6
For each customer, we may compute the fraction of his spending that is larger than 6, using the apply method:
df.groupby("Name").apply(
lambda grp: len(grp[grp["Spend"] > LIMIT]) / len(grp)
)
Name
Alice 0.0
Bob 0.5
Charles 1.0
However, the apply method is just a loop, which is slow if there are many customers.
Question: Is there a faster way, which presumably uses vectorization?
As of version 0.23.4, SeriesGroupBy does not support comparison operators:
(df.groupby("Name") ["Spend"] > LIMIT).mean()
TypeError: '>' not supported between instances of 'SeriesGroupBy' and 'int'
The code below results in a null value for Alice:
df[df["Spend"] > LIMIT].groupby("Name").size() / df.groupby("Name").size()
Name
Alice NaN
Bob 0.5
Charles 1.0
The code below gives the correct result, but it requires us to either modify the table, or make a copy to avoid modifying the original.
df["Dummy"] = 1 * (df["Spend"] > LIMIT)
df.groupby("Name") ["Dummy"] .sum() / df.groupby("Name").size()
Groupby does not use vectorization, but it has aggregate functions that are optimized with Cython.
You can take the mean:
(df["Spend"] > LIMIT).groupby(df["Name"]).mean()
df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
Or use div to replace NaN with 0:
df[df["Spend"] > LIMIT].groupby("Name").size() \
.div(df.groupby("Name").size(), fill_value = 0)
df["Spend"].gt(LIMIT).groupby(df["Name"]).sum() \
.div(df.groupby("Name").size(), fill_value = 0)
Each of the above will yield
Name
Alice 0.0
Bob 0.5
Charles 1.0
dtype: float64
Performance
Depends on the number of rows and number of rows filtered per condition, so it's best to test on real data.
np.random.seed(123)
N = 100000
df = pd.DataFrame({
"Name": np.random.randint(1000, size = N),
"Spend": np.random.randint(10, size = N)
})
LIMIT = 6
In [10]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
6.16 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df[df["Spend"] > LIMIT].groupby("Name").size().div(df.groupby("Name").size(), fill_value = 0)
6.35 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).sum().div(df.groupby("Name").size(), fill_value = 0)
9.66 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# RafaelC comment solution
In [13]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).sum() / s.size)
400 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).mean())
328 ms ± 6.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This NumPy solution is vectorized, but a bit complicated:
In [15]: %%timeit
...: i, r = pd.factorize(df["Name"])
...: a = pd.Series(np.bincount(i), index = r)
...:
...: i1, r1 = pd.factorize(df["Name"].values[df["Spend"].values > LIMIT])
...: b = pd.Series(np.bincount(i1), index = r1)
...:
...: df1 = b.div(a, fill_value = 0)
...:
5.05 ms ± 82.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a Pandas Dataframe with a column (ip) with certain values and another Pandas Series not in this DataFrame with a collection of these values. I want to create a column in the DataFrame that is 1 if a given line has its ipin my Pandas Series (black_ip).
import pandas as pd
dict = {'ip': {0: 103022, 1: 114221, 2: 47902, 3: 23550, 4: 84644}, 'os': {0: 23, 1: 19, 2: 17, 3: 13, 4: 19}}
df = pd.DataFrame(dict)
df
ip os
0 103022 23
1 114221 19
2 47902 17
3 23550 13
4 84644 19
blacklist = pd.Series([103022, 23550])
blacklist
0 103022
1 23550
My question is: how can I create a new column in df such that it shows 1 when the given ip in the blacklist and zero otherwise?
Sorry if this too dumb, I'm still new to programming. Thanks a lot in advance!
Use isin with astype:
df['new'] = df['ip'].isin(blacklist).astype(np.int8)
Also is possible convert column to categoricals:
df['new'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
print (df)
ip os new
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
For interesting in large DataFrame converting to Categorical not save memory:
df = pd.concat([df] * 10000, ignore_index=True)
df['new1'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
df['new2'] = df['ip'].isin(blacklist).astype(np.int8)
df['new3'] = df['ip'].isin(blacklist)
print (df.memory_usage())
Index 80
ip 400000
os 400000
new1 50096
new2 50000
new3 50000
dtype: int64
Timings:
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
print (len(df))
10000
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
print (len(blacklist))
100
In [320]: %timeit df['ip'].isin(blacklist).astype(np.int8)
465 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [321]: %timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
915 µs ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [322]: %timeit pd.Categorical(df['ip'], categories = blacklist.unique()).notnull().astype(int)
1.59 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [323]: %timeit df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
81.8 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Slow, but simple and readable method:
Another way to do this would be to use create your new column using a list comprehension, set to assign a 1 if your ip value is in blacklist and a 0 otherwise:
df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
>>> df
ip os new_column
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
EDIT: Faster method building on Categorical: If you want to maximize speed, the following would be quite fast, though not quite as fast as the .isin non-categorical method. It builds on the use of pd.Categorical as suggested by #jezrael, but leveraging it's capacity for assigning categories:
df['new_column'] = pd.Categorical(df['ip'],
categories = blacklist.unique()).notnull().astype(int)
Timings:
import numpy as np
import pandas as pd
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
%timeit df['ip'].isin(blacklist).astype(np.int8)
# 453 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
# 892 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'], categories = \
blacklist.unique()).notnull().astype(int)
# 565 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)