Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.
My current method is very slow on a large dataset:
import pandas as pd
# test data
df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')
# slow method
dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]
dt = pd.DataFrame(dt, index=df.columns).T
The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.
Is there are better way to prefix columns name to values?
here is a way without loops:
pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
Timings (fastest to slowest):
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy #piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.core.defchararray.add
from numpy.core.defchararray import add
a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)
dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)
a = df.columns.tolist()
pd.DataFrame(
[[f'{k}_{v}' for k, v in zip(a, t)]
for t in zip(*map(df.get, a))],
df.index, df.columns
)
This solution:
result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})
is as performant as the fastest solution above, and might be more readable, at least to some.
Related
I have a large large DataFrame. And I want to add a series to every row of it.
The following is the current way for achieving my goal:
print(df.shape) # (31676, 3562)
diff = 4.1 - df.iloc[0] # I'd like to add diff to every row of df
for i in range(len(df)):
df.iloc[i] = df.iloc[i] + diff
The method takes a lot of time. Are there any other efficient way of doing this?
You can subtract Series for vectorize operation, a bit faster should be subtract by numpy array:
np.random.seed(2022)
df = pd.DataFrame(np.random.rand(1000,1000))
In [51]: %timeit df.add(4.1).sub(df.iloc[0])
7.99 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df + 4.1 - df.iloc[0]
8.46 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [53]: %timeit df + 4.1 - df.iloc[0].to_numpy()
7.59 ms ± 59.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [49]: %%timeit
...: diff = 4.1 - df.iloc[0] # I'd like to add diff to every row of df
...: for i in range(len(df)):
...: df.iloc[i] = df.iloc[i] + diff
...:
433 ms ± 50.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.
It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.
I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])
while my working solution is the following:
pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.
Is there any more beautiful or efficient method to do this?
One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:
def fun (a,b):
return a.startswith(b) or b.startswith(a)
f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0 True
1 False
dtype: bool
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)
In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})
I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.
I could iteratively derive a mask and use it in loc, like
m = np.array([isinstance(v, int) for v in df['mixed']])
df.loc[m, 'mixed'] *= 10
# df
# mixed
# 0 2020-10-04 00:00:00
# 1 9990
# 2 a string
That does the trick but I was wondering if there was a more pandastic way of doing this?
One idea is test if numeric by to_numeric with errors='coerce' and for non missing values:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Unfortunately is is slow, some another ideas:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.
Still need call type
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
If you want to multiple all 'numbers' then you can use the following.
Let's use pd.to_numeric with parameter errors = 'coerce' and fillna:
df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
df
Output:
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Let's add a float to the column
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string', 100.3]})
Using #BenYo:
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
Output (note only the integer 999 is multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 100.3
Using #jezrael and similiarly this solution:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print(df)
# Or this solution
# df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
Output (note all numbers are multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 1003
If you do many calculation and have a littile more memory, I suggest you to add a column to indicate the type of the mixed, for better efficiency. After you construct this column, the calculation is much faster.
here's the code:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
df["mixed_type"] = df.mixed.map(lambda x: type(x).__name__).astype('category')
m = df.mixed_type == 'int'
df.loc[m, "mixed"] *= 10
del df["mixed_type"] # after you finish all your calculation
the mixed_type column repr is
0 Timestamp
1 int
2 str
3 Timestamp
4 int
...
2999995 int
2999996 str
2999997 Timestamp
2999998 int
2999999 str
Name: mixed, Length: 3000000, dtype: category
Categories (3, object): [Timestamp, int, str]
and here's the timeit
>>> %timeit df.mixed_type == 'int'
472 µs ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.12 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For not very long data frames I can suggest this way as well:
df = df.assign(mixed = lambda x: x.apply(lambda s: s['mixed']*10 if isinstance(s['mixed'], int) else s['mixed'],axis=1))
Data
pb = {"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222"},"mark_up":{"0":1.2987,"1":1.5625,"2":1.3698,"3":1.3333,"4":1.4589}}
data = {"id":{"0":"K69","1":"K70","2":"K71","3":"K72","4":"K73","5":"K74","6":"K75","7":"K79","8":"K86","9":"K100"},"cost":{"0":29.74,"1":9.42,"2":9.42,"3":9.42,"4":9.48,"5":9.48,"6":24.36,"7":5.16,"8":9.8,"9":3.28},"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222","5":"333","6":"444","7":"555","8":"666","9":"777"}}
pb = pd.DataFrame(data=pb).set_index('mark_up_id')
df = pd.DataFrame(data=data)
Expected Output
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
Or..
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
Question
Is there a function that does the above, or is this the best way to go about this type of operation?
I would use the second solution you propose or better this:
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
I think using update can be uncomfortable because it doesn't return anything.
Let's say Series.fillna is more flexible.
We can also use DataFrame.assign
in order to continue working on the DataFrame that the assignment returns.
df.assign( Cost=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost']) )
Time comparision with join method
%%timeit
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
#945 µs ± 46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
#3.59 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
slow..
%%timeit
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
#985 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally,I recommend you see: Underastanding inplace and When I should use apply
I have a Pandas Dataframe with a column (ip) with certain values and another Pandas Series not in this DataFrame with a collection of these values. I want to create a column in the DataFrame that is 1 if a given line has its ipin my Pandas Series (black_ip).
import pandas as pd
dict = {'ip': {0: 103022, 1: 114221, 2: 47902, 3: 23550, 4: 84644}, 'os': {0: 23, 1: 19, 2: 17, 3: 13, 4: 19}}
df = pd.DataFrame(dict)
df
ip os
0 103022 23
1 114221 19
2 47902 17
3 23550 13
4 84644 19
blacklist = pd.Series([103022, 23550])
blacklist
0 103022
1 23550
My question is: how can I create a new column in df such that it shows 1 when the given ip in the blacklist and zero otherwise?
Sorry if this too dumb, I'm still new to programming. Thanks a lot in advance!
Use isin with astype:
df['new'] = df['ip'].isin(blacklist).astype(np.int8)
Also is possible convert column to categoricals:
df['new'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
print (df)
ip os new
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
For interesting in large DataFrame converting to Categorical not save memory:
df = pd.concat([df] * 10000, ignore_index=True)
df['new1'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
df['new2'] = df['ip'].isin(blacklist).astype(np.int8)
df['new3'] = df['ip'].isin(blacklist)
print (df.memory_usage())
Index 80
ip 400000
os 400000
new1 50096
new2 50000
new3 50000
dtype: int64
Timings:
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
print (len(df))
10000
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
print (len(blacklist))
100
In [320]: %timeit df['ip'].isin(blacklist).astype(np.int8)
465 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [321]: %timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
915 µs ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [322]: %timeit pd.Categorical(df['ip'], categories = blacklist.unique()).notnull().astype(int)
1.59 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [323]: %timeit df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
81.8 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Slow, but simple and readable method:
Another way to do this would be to use create your new column using a list comprehension, set to assign a 1 if your ip value is in blacklist and a 0 otherwise:
df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
>>> df
ip os new_column
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
EDIT: Faster method building on Categorical: If you want to maximize speed, the following would be quite fast, though not quite as fast as the .isin non-categorical method. It builds on the use of pd.Categorical as suggested by #jezrael, but leveraging it's capacity for assigning categories:
df['new_column'] = pd.Categorical(df['ip'],
categories = blacklist.unique()).notnull().astype(int)
Timings:
import numpy as np
import pandas as pd
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
%timeit df['ip'].isin(blacklist).astype(np.int8)
# 453 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
# 892 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'], categories = \
blacklist.unique()).notnull().astype(int)
# 565 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)