How to add string in the middle of a column in pandas - python

I'm looking for a solution that doesn't involve an .apply or lambda function that loops through the list and ads the string at desired index. I have a column like this with many entries:
df = pd.DataFrame(["1:77631829:-:1:77641672:-"], columns=["position"])
position
0 1:77631829:-:1:77641672:-
I'd like:
position
0 chr1:77631829:-:chr1:77641672:-
So insert "chr" at beginning and after third colon :
I would have thought something like this would do, but insert hasn't been implemented in series:
"chr" + df["position"].str.split(":").insert(3, "chr").str.join(":")
This does it, but looks inefficient:
"chr" + df["position"].str.split(":").str[:3].str.join(":") + "chr" + df["position"].str.split(":").str[3:].str.join(":")

I think you can use split by 3 value of :, then extract head and tail of lists - join head, add ch to tail, prepend ch and last append to list L:
df = pd.DataFrame(["1:77631829:-:1:77641672:-","1:77631829:-:1:77641672:-"],
columns=["position"])
print (df)
position
0 1:77631829:-:1:77641672:-
1 1:77631829:-:1:77641672:-
L = []
for x in df["position"]:
*i, j = x.split(':', 3)
L.append(("chr" + ':'.join(i) + "chr" + j))
df['new'] = L
print (df)
position new
0 1:77631829:-:1:77641672:- chr1:77631829:-chr1:77641672:-
1 1:77631829:-:1:77641672:- chr1:77631829:-chr1:77641672:-
Hack solution with comments:
'chr' + df['position'].str.replace('-:', '-:chr')
Faster with list comprehension and f-strings:
df['new'] = [f"ch{x.replace('-:', '-:chr')}" for x in df['position']]
Performance:
df = pd.DataFrame(["1:77631829:-:1:77641672:-","1:77631829:-:1:77641672:-"],
columns=["position"])
#[20000 rows x 1 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [226]: %%timeit
...: L = []
...: for x in df["position"]:
...: *i, j = x.split(':', 3)
...: L.append(("chr" + ':'.join(i) + "chr" + j))
...:
...: df['new1'] = L
...:
18.9 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [227]: %%timeit
...: df['new2'] = "chr" + df["position"].str.split(":").str[:3].str.join(":") + "chr" + df["position"].str.split(":").str[3:].str.join(":")
...:
50.8 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [228]: %%timeit
...: df['new3'] = 'chr' + df['position'].str.replace('-:', '-:chr')
...:
21.5 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [229]: %%timeit
...: df['new4'] = [f"ch{x.replace('-:', '-:chr')}" for x in df['position']]
...:
8.59 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Best way to execute multiple lines of pandas in parallel? (Speed up)

Basically, I am performing simple operation and updating 100 columns of my dataframe of size (550 rows and 2700 columns).
I am updating 100 columns like this:
df["col1"] = df["static"]-df["col1"])/df["col1"]*100
df["col2"] = (df["static"]-df["col2"])/df["col2"]*100
df["col3"] = (df["static"]-df["col3"])/df["col3"]*100
....
....
df["col100"] = (df["static"]-df["col100"])/df["col100"]*100
This operation is taking 170 ms in my original dataframe. I want to speed up the time. I am doing some real-time thing, so time is important.
You can select all columns and subtract with right side by DataFrame.rsub with DataFrame.div only columns vby list cols`:
cols = [f'col{c}' for c in range(1, 101)]
df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
Performance:
np.random.seed(2022)
df=pd.DataFrame(np.random.randint(1001, size=(550,2700))).add_prefix('col')
df = df.rename(columns={'col0':'static'})
In [58]: %%timeit
...: for i in range(1, 101):
...: df[f"col{i}"] = (df["static"]-df[f"col{i}"])/df[f"col{i}"]*100
...:
59.9 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [59]: %%timeit
...: cols = [f'col{c}' for c in range(1, 101)]
...: df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
...:
11.9 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Is there a way to optimize my list comprehension for better performance? It is slower than a for loop

I am trying to optimize my code for looping through an ASC raster file. The input to the function is the data array from the ASC file with the shape 1.000 x 1.000 (1mio data points), the ASC file information and a column skipping value. The skip value is not important in this case.
My function with a for loop code performs decent and skips an array cell if the data == nodata_value. Here is the function:
def asc_process_single(self, asc_array, asc_info, skip=1):
# ncols = asc_info['ncols']
nrows = asc_info['nrows']
xllcornor = asc_info['xllcornor']
yllcornor = asc_info['yllcornor']
cellsize = asc_info['cellsize']
nodata_value = asc_info['nodata_value']
raster_size_y = cellsize*nrows
# raster_size_x = cellsize*ncols
# Looping over array rows and cols with skipping
xyz = []
for row in range(asc_array.shape[0])[::skip]:
for col in range(asc_array.shape[1])[::skip]:
val_z = asc_array[row, col] # Z value of datapoint
# The no data value is not processed
if val_z == nodata_value:
pass
else:
# Xcoordinate for current Z value
val_x = xllcornor + (col * cellsize)
# Ycoordinate for current Z value
val_y = yllcornor + raster_size_y - (row * cellsize)
# x, y, z to LIST
xyz.append([val_x, val_y, val_z])
return xyz
Timing this for 7 repeats on an ASC file where nodata_value(s) are present is:
593 ms ± 34.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
I thought i could do this better with a list comprehension:
def asc_process_single_listcomprehension(self, asc_array, asc_info, skip=1):
# ncols = asc_info['ncols']
nrows = asc_info['nrows']
xllcornor = asc_info['xllcornor']
yllcornor = asc_info['yllcornor']
cellsize = asc_info['cellsize']
nodata_value = asc_info['nodata_value']
raster_size_y = cellsize*nrows
# raster_size_x = cellsize*ncols
# Looping over array rows and cols with skipping
rows = range(asc_array.shape[0])[::skip]
cols = range(asc_array.shape[1])[::skip]
xyz = [[xllcornor + (col * cellsize),
yllcornor + raster_size_y - (row * cellsize),
asc_array[row, col]]
for row in rows for col in cols
if asc_array[row, col] != nodata_value]
return xyz
However, this performs slower than my for loop and i am wondering why?
757 ms ± 58.4 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
Is it because the list comprehension looks up asc_array[row, col] twice? This operation alone cost
193 ns ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
versus just assigning using the z value from an already lookup value in the array in my for loop
51.2 ns ± 1.18 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Doing this 1 mio times would add up the time spent doing this for the list comprehension.
Any ideas how to optimize my list comprehension further so it is better performing than my for loop? Any other ideas to improve performance?
EDIT:
Solution:
I tried the 2 proposals given.
Reference my Z value in my list comprehension and not do
the lookup in the array twice which took longer to do.
Rewrrite the function to handle the problem with numpy arrays
The list comprehension i rewrote to this:
xyz = [[xllcornor + (col * cellsize),
yllcornor + raster_size_y - (row * cellsize),
val_z]
for row in rows for col in cols for val_z in
[asc_array[row, col]]
if val_z != nodata_value]
and the numpy function became this:
def asc_process_numpy_single(self, asc_array, asc_info, skip):
# ncols = asc_info['ncols']
nrows = asc_info['nrows']
xllcornor = asc_info['xllcornor']
yllcornor = asc_info['yllcornor']
cellsize = asc_info['cellsize']
nodata_value = asc_info['nodata_value']
raster_size_y = cellsize*nrows
# raster_size_x = cellsize*ncols
rows = np.arange(0,asc_array.shape[0],skip)[:,np.newaxis]
cols = np.arange(0,asc_array.shape[1],skip)
x = np.zeros((len(rows),len(cols))) + xllcornor + (cols * cellsize)
y = np.zeros((len(rows),len(cols))) + yllcornor + raster_size_y - (rows *
cellsize)
z = asc_array[::skip,::skip]
xyz = np.asarray([x,y,z]).T.transpose((1,0,2)).reshape(
(int(len(rows)*len(cols)), 3) )
mask = (xyz[:,2] != nodata_value)
xyz = xyz[mask]
return xyz
I added the mask in the last 2 lines of the numpy function because i dont want the nodata_values.
The performance is as follows in the order; for loop, list comprehension, list comprehension suggestion and numpy function suggestion:
609 ms ± 44.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
706 ms ± 22 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
604 ms ± 21.5 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
70.4 ms ± 1.26 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
The list comprehension compares to the for loop when optimized, but the numpy function speeds up the party with a factor of 9.
Thank you so much for your comments and suggestions. I learned a lot today.
Yes, assessing your array twice increases the calculation time. Here are my test cases:
def funLoop(A):
xyz = []
for row in range(A.shape[0]):
for col in range(A.shape[1]):
xyz.append([col, row, A[row, col] ])
def funListComp1(A):
xyz = [ [col, row, A[row, col] ]
for row in range(A.shape[0]) for col in range(A.shape[1])]
def funListComp2(A):
xyz = [ [col, A[row, col], A[row, col] ]
for row in range(A.shape[0]) for col in range(A.shape[1])]
A = np.random.rand(1000,1000)
%timeit funLoop(A)
%timeit funListComp1(A)
%timeit funListComp2(A)
457 ms ± 70.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
378 ms ± 8.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
779 ms ± 309 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With respect to large data you should always prefer using numpy instead of python for-loops. In your case the numpy code would look somewhat like:
def asc_process_single_numpy(asc_array):
nodata_value = np.nan
raster_size_y = 1
skip = 2
xllcornor = 0
yllcornor = 0
cellsize = 1
rows = np.arange(0,asc_array.shape[0],skip)[:,np.newaxis]
cols = np.arange(0,asc_array.shape[1],skip)
#for row in rows for col in cols
x = np.zeros((len(rows),len(cols))) + xllcornor + (cols * cellsize)
y = np.zeros((len(rows),len(cols))) + yllcornor + raster_size_y - (rows * cellsize)
z = asc_array[::skip,::skip]
return np.asarray([x,y,z]).T.transpose((1,0,2)).reshape( (int(len(rows)*len(cols)), 3) )
A = np.random.rand(1000,1000)
%timeit asc_process_single(A)
%timeit asc_process_single_listcomprehension(A)
%timeit asc_process_single_numpy(A)
183 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
210 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.3 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The only thing I can imagine that's slowing you down is that in the original code, you put asc_array[row, col] into a temporary variable, while in the list comprehension, you evaluate it twice.
There are two things you might want to try:
Assign the value to val_z in the "if" statement using a walrus operator, or
Add for val_z in [asc_array[row, col]] after the other two fors.
Good luck.

Vectorized method for mapping a list from one Dataframe row to another Dataframe row

Given a dataframe df1 table that maps ids to names:
id
names
a 535159
b 248909
c 548731
d 362555
e 398829
f 688939
g 674128
and a second dataframe df2 which contains lists of names:
names foo
0 [a, b, c] 9
1 [d, e] 16
2 [f] 2
3 [g] 3
What would be the vectorized method for retrieve the ids from df1 for each list item in each row like this?
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
This is a working method to achieve the same result using apply:
import pandas as pd
import numpy as np
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
df2 = df2.apply(with_apply, axis=1)
I think vecorize this is really hard, one idea for improve performance is map by dictionary - solution use if y in d for working if no match in dictioanry:
df1 = df1.set_index('names')
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
If all values match:
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
Test for 4k rows:
np.random.seed(2020)
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
In [8]: %%timeit
...: df2.apply(with_apply, axis=1)
...:
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %%timeit
...: d = df1['id'].to_dict()
...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
...:
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
...:
...:
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One way using operator.itemgetter:
from operator import itemgetter
def listgetter(x):
i = itemgetter(*x)(d)
return list(i) if isinstance(i, tuple) else [i]
d = df.set_index("name")["id"]
df2["ids"] = df2["names"].apply(listgetter)
Output:
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
Benchmark on 100k rows:
d = df.set_index("name")["id"] # Common item
df2 = pd.concat([df2] * 25000, ignore_index=True)
%%timeit
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
# 453 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2["ids2"] = df2["names"].apply(listgetter)
# 349 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
# 371 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this seems to work:
df2['ids'] = list(df1.loc[name]['id'].values for name in df2['names'])
interested to know if this is the best approach

Concatenate Pandas column name to column value

Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.
My current method is very slow on a large dataset:
import pandas as pd
# test data
df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')
# slow method
dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]
dt = pd.DataFrame(dt, index=df.columns).T
The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.
Is there are better way to prefix columns name to values?
here is a way without loops:
pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
Timings (fastest to slowest):
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy #piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.core.defchararray.add
from numpy.core.defchararray import add
a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)
dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)
a = df.columns.tolist()
pd.DataFrame(
[[f'{k}_{v}' for k, v in zip(a, t)]
for t in zip(*map(df.get, a))],
df.index, df.columns
)
This solution:
result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})
is as performant as the fastest solution above, and might be more readable, at least to some.

pandas mean values with where condition

I would like to calculate the mean of age excluding the value 99. In real life the dataframe is much bigger, and I have other possible variables.
Is there a more efficient way (faster or more elegant) to do it? Maybe with a pivot table or group by, or a function?
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
numpy solution is faster - first create array and then filter:
arr = df['age'].values
not99 = arr != 99
mean_for_age = arr[not99].mean()
But if need generally solution for possible select another column use your solution:
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
mean_for_age = df.loc[not99, 'another col'].mean()
Timings (depends of data, best test with real data):
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
df = pd.concat([df] * 10000, ignore_index=True)
In [14]: %%timeit
...: arr = df['age'].values
...: not99 = arr != 99
...:
...: mean_for_age = arr[not99].mean()
...:
496 µs ± 36.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %%timeit
...: not99 = df['age'] != 99
...: mean_for_age = df.loc[not99, 'age'].mean()
...:
1.82 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %%timeit
...: df.query("age != 99")['age'].mean()
...:
4.26 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources